
\1998 


Open Systems: 

The Common Thread 


CONFERENCE 

PROCEEDINGS 


CONFERENCE 

$ EXHIBITION 


Li 














e-business 



HOW TO HANDLE MILLIONS OF UNEXPECTED GUESTS. 


The only thing faster than word-of-mouth 
advertising is word-of-e-mail advertising. A positive 
reputation in cyberspace can bring you millions of 
new customers; however bad news travels at the 
speed of light. 

When you start sending customers to your 
Web site to do more than browse, you have to be 
concerned about the quality of their experience. 
Too much demand, and the performance of your 
Web site can slow to an annoying crawl. Far too 
much demand, and users won’t be able to connect 
at all. e-business, after all, is about interactivity 
- buying, selling, customer service, etc. - and if 
customers can’t get through, they can’t interact. 

This is why scalability is a major issue for any 
business thinking seriously about becoming an 
e-business. Scalability is simply the ability to easily 
increase the capacity of your Web site - to handle 
more visitors or unexpected spikes in volume. 


IBM designs scalability into all our Web 
technology - hardware and software. So if your 
site gets 4 million hits when you expected only 
1 million, you can adapt quickly. As we’ve done 
with some of history’s most heavily trafficked 
Web sites: Deep Blue” (74 million hits over 9 days) 
in 1997 and the 1998 Nagano Olympics Winter 
games (nearly 650 million hits in 16 days). 

Scalability is just one e-business problem we 
can help you solve. We’ve helped thousands of 
businesses move their core processes to the Web to 
lower costs, improve customer service and actually 
sell things. To find out more about the latest IBM 
solutions visit www.ibm.com.au or call 132 426 and 
ask for e-business/info. 



Solutions for a small planet” 



Ask IBM why the new 
Lotus Domino GO Web 
Server Pro with its 
advanced Web site design 
management tools, 
security capabilities and 
scalable design may well 
be the ultimate on-ramp 
to the Internet. 



Our scalable RS/6000* 
Web server technology 
can help your e-business 
cope with enormous 
Web traffic. RS/6000 
Web servers have 
powered some of the 
highest volume Web 
sites in history. 


WWW. 

ibm 

com.au 

Ask IBM how scalable 
technology (both 
hardware and software) 
can help your Web site 
cope with unexpected 
spikes in demand - while 
doing business 24 hours 
a day, 7 days a week. 








Proceedings 


of the 

AUUG’98 Conference 

September 16-18, 1998 


"The Common Thread" 


Sydney Hilton Hotel 
NSW, Australia 



Sponsors 

The AUUG’98 Conference gratefully acknowledges assistance from the fol¬ 
lowing organisations: 

Auscom Publishing 
Compaq 
IBM Australia 
SCO 

StorageTek 
Sun Microsystems 

The AUUG’98 Conference also acknowledges assistance from The USENIX 
Association for the provision of a Student Prize for the author of the paper 
judged by the Program Committee to be the best paper by a full time stu¬ 
dent. 

AUUG’98 Conference and Program Committee 

Stephen Boucher, MTIA 
Lucy Chubb, Softway Pty Ltd. 

Frank Crawford, ANSTO (Conference Chair) 

Liz Egan, AUUG 

Wael Foda, ACMS (Conference Organiser) 

Andrew McRae, Cisco Systems (Program Chair) 

Michael Paddon, ABA Pty Ltd. 

David Purdue 


Editor: Andrew McRae (amcrae@cisco.com) 
ISBN: 0 646 36098 1 
© AUUG’98 

This volume is published as a collective work: 
Rights to individual papers remain with 
the author or the author’s employer. 


li 



Preface 


On behalf of the AUUG’98 Program Committee, I am delighted to present 
the Conference Proceedings of this year’s AUUG National Technical Con¬ 
ference, and I am sure you will be impressed by the strength of this year’s 
technical program. 

The theme for this year’s conference is Open Systems: The Common Thread. 
Nowhere is this better reflected than in the collection of papers presented 
within this volume, which ranges from theoretical studies through to com¬ 
mercial examples of Open Systems at work. The three major streams for the 
conference are UNIX, Security, and Internetworking. 

When I think back to the first time that I used a UNIX system, some 20 years 
ago (I didn’t even know what a ‘UNIX’ system was), and to have seen the 
phenomonal growth and deployment of UNIX in the ’80s and ’90s, through 
to the commercial Open Systems ‘wars’, I am amazed (and pleased) that the 
UNIX ‘Live Free Or Die’ spirit not only lives on, but has been revitalised 
through the popularity of freely available systems such as Linux and 
FreeBSD. We are pleased to have representatives of organisations that pro¬ 
mote and distribute these systems to come and speak to us about their expe¬ 
riences and direction. 

It has been said that the Internet is changing the way that we work, play and 
live. It is truly a communications revolution, and an exciting time for us all. 
This is reflected in the number of technical papers that are directly or indi¬ 
rectly related to the Internet; we are pleased to welcome contributions from 
the Australian Chapter of the Internet Society. AUUG historically was the 
forum where people met and ‘networked’ about networking, and I am 
pleased to see that we still have a close co-operative relationship with organ¬ 
isations that AUUG has had some hand in planting. 

UNIX traditionally has always had a strong security focus, and we are 
delighted to welcome Robert Morris as a keynote speaker; Bob Morris was 
involved right from the early UNIX days in developing security software for 
UNIX. I am very pleased to see that we have continued to attract strong secu¬ 
rity and encryption related technical papers, so that we can provide compre¬ 
hensive coverage of new developments in this field. 

We are also pleased to have other significant speakers join us, that can bring 
new challenges or insight on issues in the Open Systems world. Richard 
Stallman is a prime example of the belief that one person can make a differ¬ 
ence, with his founding and development of the GNU project, and the pro¬ 
mulgation of the free software philosophy. 

Open Systems is truly the common thread knitting these themes together. 
Enjoy! 

Andrew McRae (AUUG’98 Program Chair) 



Contents 

101; File Systems and Storage 

Storage Appliance Architecture . 1 

Alex Miroshnichenko 

The Vinum Volume Manager . 11 

Greg Lehey 

102: Change Control & Management 

Scripts to Manage Change . 39 

Julie Jester & Stephanie Nicholls 

Experience Using CVS for Long-Running Projects . 51 

Peter Chubb 

Autolnstall: Automating Platform Installation . 59 

Gordon Rowell & Peter Bray 

201; Security - Encryption 

A Current Perspective on Encryption Algorithms . 67 

Lawrie Brown 

Which One is for You : Issues of Trust with 

Public Key Certificates . 77 

Tinan Yang, Lawrie Brown & Jan Newmarch 

Torn Money and the PGP Web of Trust . 95 

Jeanette McLeod & Greg Rose 

202: Security - Technologies 

Unified Authentication Using PAM . 105 

Anthony Baxter 

IPsec Encryption . 113 

Phillip Yialeloglou 


IV 













203: Network Security 

A SESAME Linux Environment . 123 

Bradley Broom & Paul Ashley 

Intranet Security technologies - Sesame or SSL? . 133 

Paul Ashley, Joris Claessens, Gary Gaskell & Mark Vandenwauver 

Remote Operating System Identification . 143 

Anthony Osborne 

204: UNIX Architecture 

Unix Systems Programming Using Java . 151 

Jan Newmarch 

HP-UX 11.0 and futures ... 165 

John Knaggs 

301: Network Technologies 

Optical Networking . 181 

Harry Dutton 

Building High Volume Distributed Applications 

using CORBA . 199 

Saul Cunningham 

SAMBA . 213 

John Terpstra 

302: ISOC-AU 

The Domain Name System: Engineering vs Economics . 221 

Kate Lance 

Quality of Service in the Internet: Fact, Fiction, 

or Compromise? . 231 

Geoff Huston & Paul Ferguson 

MARSHNet: A Multiservice Asynchronous Transfer 

Mode Wide-Area Network . 257 

Shaun W. Amy & James D. Argyros 


v 














303; Network Management 


Implementation of a User-Pays Library Printing 
System: a Case Study . 269 

Kay Darbyshire & Richard Jacewicz 

An Architecture for Remote Network Management using 
the RMON MIB and Programmable Agents . 277 

Bradley Williamson & Craig Farrell 

Who’s Sucking My Data? - a 1970’s Algorithm 
Rescues a 1990’s Problem . 295 

George Michaelson 

304: Clustering 

Using clustered Linux PCs for parallel processing . 305 

Robert Hart 

Service Failover by Dynamic DNS Updates . 313 

Peter Gray 

PARMON: A Comprehensive Cluster Monitoring System .. 319 

Rajkumar Buyya 


vi 








Storage Appliance Architecture. 


Alex Miroshnichenko 
VERITAS Software Corporation, 
Mountain View, CA, USA. 
alex@veritas.com 


It is hard to overestimate the importance of data storage management for modem 
information technology. 10 years ago Think magazine observed: "Without direct access 
to large volumes of business data, the computer industry today would not only be 
unrecognizable, but probably insignificant. With magnetic disk storage, everything has 
been possible, from moon missions to home computers." 

Since then data storage industry experienced explosive growth. Server systems with 
terabyte disk farms are becoming quite common. In a typical commercial server 
configuration data storage accounts for more than 90% of total system cost. At the same 
time the cost of managing the storage is growing at an alarming rate and often is 
perceived that one of the main growth-limiting factors. 


Enterprise Storage Requirements. 

The focus of enterprise data management is shifting from host and server to specialized 
storage subsystems. The concept of redundant arrays of inexpensive disks (RAID) was 
introduced in 1988 [1], Since then storage hardware vendors sold billions of dollars 
worth of disk arrays. The majority of array products implement various levels of 
hardware redundancy balancing it with performance requirements and costs. The industry 
standard RAID level definitions provide common metrics for comparison of disk array 
products [2,3]. The detailed review of the current array market is beyond the scope of this 
paper, suffice to say that there are about 150 vendors playing in this field with total 
revenues of about $US 15.9 billion in 1997. More than $US 17 billion is projected for the 
year 2000 [4], 

However, the vast majority of modem disk array vendors provide little more than 
enclosures and cabinets for housing numerous individual disks. Granted, such enclosures 
can be quite sophisticated: they may include redundant power supplies, hot-swapping 
capabilities for all major hardware components including the disks themselves, even 
redundant data connection ports with automatic fail-over. The current disk array 
architectures solve the hardware reliability problems but provide very little towards 
storage administration. Technical and marketing discussions of the RAID and array 
technology are usually centered on various RAID levels and degrees of hardware 
redundancy, but rarely mention the ease of administration and system maintenance. 
Performance discussions are typically limited to the types of standard hardware 
interconnects and maximum number of connection ports. 


Storage Appliance Architecture 


1 



The information technology (IT) demands more sophistication from the storage systems 
than most of them can provide. Modem enterprise critically depends on its data and 
ability to manage it in an efficient fashion. IT departments demand storage that acts as a 
strategic element of an IT structure, they reject the notion of storage as an isolated CPU 
add-on or peripheral. Beyond simply holding information, this storage must allow 
companies to manage, protect, provide access to and efficiently plan the growth of 
enormous amounts of information previously dispersed on multiple servers. So the 
individual storage subsystems and storage arrays must support additional functionality to 
meet those requirements. Let's enumerate and analyze these requirements: 

• Ease of administration: given the size of storage systems involved it makes little 
sense to administer the storage arrays as a collection of distinct individual disks. The 
host system as well as system administrators should be presented with storage objects 
with all the details of mapping such objects to the actual physical spindles are hidden. 
The storage subsystem presents a set of virtual storage objects to the outside world; 
the administration interface should allow easy creating, deletion and modification of 
such virtual object regardless of the physical content. 

• Shared storage access: multiple host computer systems must be able to use the 
storage, multiple independent physical connections to the storage must be provided, 
i.e. a storage subsystem must be partitionable for the use by multiple host computers. 

• Shared data access: multiple hosts systems should be able to access the same virtual 
storage object, such hosts may have different architecture (heterogeneous access). 

• Automatic backup: regardless of how reliable a storage subsystems may be, it still 
must be backed up to an independent medium. Such backups must be performed with 
minimal possible impact on the normal operation. Ideally the backups are done 
without any intervention of the host computer systems, while the host system actively 
accesses the stored data. 

• Logical storage protection: backups normally protect from the physical failures, 
however it is known that the logical failures (operator errors and software bugs) are 
the major source of data loss and computer system downtime. A storage subsystem 
should be able to create and maintain historical data images in order to facilitate quick 
logical recovery. 

• Scalability: no matter how big a box is, at some point in the future the storage 
demands will exceed its capacity. On the other hand, a number of host systems that 
need to be connected to the storage subsystem may also exceed the number of 
connection ports in the initial configuration. The storage architecture must provide 
and easy capacity and connectivity growth path which does not require replacing one 
box with a bigger box. Performance scalability is also an issue that needs to be 
addressed. 

• Remote data replication and mirroring. Modem enterprises are geographically 
dispersed. For both disaster recovery and locality of data access purposes the data 
must be replicated to multiple physical locations. IT managers are increasingly 
viewing such replication as an integral part of a storage subsystem functionality. 

• Multiple access methods: in the enterprise environments a storage subsystem must 
support the maximum possible number of data access protocols and standard 
interconnects. Standard device protocols like SCSI and Fibre Channel (which may be 


2 


Storage Appliance Architecture 



considered a variant of SCSI) must be supported, however it may also be desirable to 
provide file level access over the standard network using NFS and/or CIFS protocols. 

This list is not necessarily complete, however even as is it is quite clear that existing disk 
arrays cannot meet all the above requirements. Some of the high-end arrays are beginning 
to address the logical partitioning issues. No array vendor offers a complete set of these 
features with a notable exception of the EMC Corporation [5], Its Symmetrix™ product 
line offers all kinds of sophisticated storage services including logical data protection and 
remote replication over dedicated network links. However, it comes at a price: 
US$700,000 is a typical cost for a full-featured Symmetrix™ array. Besides, it also faces 
scalability problems: once the capacity of a single box has been reached it is very 
difficult to extend the same set of integrated storage services to multiple arrays. 

The apparent failure of the array vendors to meet the market requirements for the mass 
storage systems can be explained by the complexity of the problems. Modem disk arrays 
are in fact computer systems even though the hardware vendors do not emphasize this 
fact (many salesmen do not even realize it!). They employ mostly real-time operating 
system kernels (RTOS) to implement low-level disk control algorithms. It is increasingly 
difficult and costly to implement the sophisticated features described above in the 
traditional RTOS. The market leader EMC Corporation has invested huge amounts of 
money in its proprietary real-time operating system software, which in fact provides most 
of the unique value to the Symmetrix™ arrays. Those vendors without deep pockets find 
it almost impossible to follow the suite. 


An Open Systems Solution. 

The proposed solution is simple and almost obvious to anybody who has studies the 
developments in the computer industry over the last 20 years (which includes essentially 
every information technology professional today): use commodity, general purpose 
hardware and software to build specialized storage subsystems. 

A careful look at the history of file servers shows that the battle between the general 
purpose and specialized solutions has been going on for quite a while. AUSPEX has been 
successful in developing a special purpose hardware and proprietary software solution to 
implement a high performance NFS file server. However as the general purpose 
computer systems caught up in performance. AUSPEX has found it increasingly difficult 
to maintain it is proprietary design. At the time of this writing a general purpose Ultra 
Enterprise computers from Sun Microsystems can provide essentially the same level of 
NFS server performance at a lower cost. 

Network Appliance Corporation [6] products are another successful example of a 
specialized file server solution. NetApps used commodity hardware components and a 
proprietary operating system to build a highly successful product. The DataOnTap™ 
operating system has designed-in support for a number of advanced storage management 
functions, including logical data snapshot capability which provides a logical data 


Storage Appliance Architecture 


3 



protection and "time travel". But here again, the markets demands of high availability, 
remote data replication present a serious problem: even a simple failover solution for 
NetApps servers requires use of extremely expensive specialized hardware for NVRAM 
mirroring. 

The storage management and high availability software for general purpose operating 
systems has made considerable progress in the last few years. Most of the modem 
systems now include virtual disk and volume managers, which greatly simplify the 
complexity of the storage administration and often used as virtual disk arrays. Log-based 
high performance file systems minimize system recovery time and greatly improve I/O 
performance. Failover solutions and software data replication software make the task of 
building highly available configurations out of commodity components practical (if not 
simple). I have seen a number of cases when a commodity based configuration build 
with Sun Microsystems servers was preferred to the NetApps for the highly available file 
server solution. 

It is rather obvious that the general-purpose computer systems running general-purpose 
open operating systems like modem variants of UNIX are competitive with specialized 
offerings and in fact have an advantage of being more flexible. Open systems can more 
easily benefit form the economies of scale in software development and they adapt faster 
to the ever-increasing market demands. 

By the same logic open systems can be utilized to build the specialized storage 
subsystems like disk arrays. 

The statement above often brings a set of objections, which usually claim that any 
software, that control a disk array must be based on an RTOS. However upon careful 
examination of the modem I/O architectures it is hard to support these claims. With the 
exception of some niche-markets like video-on-demand, none of the modem data 
processing application running on server systems have any kind of real time response 
requirements. In fact the I/O architecture of modem UNIX OS is not capable of the real 
time operation in the strict sense. The I/O requests are typically queued for processing by 
I/O drivers and the service time while predictable is not guaranteed. 

Modem UNIX OS like Solaris™ from Sun Microsystems have made considerable 
progress in the area of performance, I/O performance in particular. Solaris™ architecture 
includes kernel threads, which allow a very efficient server implementations with 
minimum overhead for context switching. I/O subsystems make a wide use of direct I/O 
which minimizes data buffer copying. 

Sample architecture for interconnecting several hosts systems to storage subsystem is 
illustrated on the figure 1. The storage controllers are general-purpose computers running 
Solaris OS, such computers maybe either x86 PC or Sparc based. Both front and back¬ 
end storage interconnects can be either Fibre Channel or SCSI however it is practical to 
assume that the front-end is more likely to be Fibre Channel than the back-end. In order 
to provide scalability and fault tolerance the storage controllers must be configured as 


4 


Storage Appliance Architecture 




failover configurations. The figure 1 shows a pair of storage controllers, however given 
the right level and sophistication of the failover software this number can be increased. 
The storage and the controllers themselves need to be administered; it is practical to 
implement the administration interface via standard TCP/IP network. It is assumed that 
the issues in managing such networks are well understood and are beyond the scope of 
this paper. 



interconnect (SCSI) 


Application 
server (HOST) 


Application 
server (HOST) 


Application 
server (HOST) 


TCP/IP Network (client connections 
and administration 


I/O Front End interconnect 
(Fibre Channel) 


Figure 1 


Storage Appliance Architecture 


5 



















Storage Appliance. 

Storage appliance is a collection of UNIX OS software necessary to implement the full 
functionality of the storage controller as described above. A version of the storage 
appliance using components from VERITAS Software Corporation [7] is presented on 
the figure 2. 

Storage appliance presents regular files created in the name space inside the Solaris OS as 
virtual devices on the external SCSI and/or Fibre Channel connection. It is done with the 
help of Target Mode Driver and Target Mode Server - these are the only truly new 
components that needed to be designed from scratch. The idea of a target mode driver is 
best understood on the example of SCSI. Typically any computer system contains a SCSI 
Host Bus Adapter (HBA) which operates in the initiator mode ; i.e. it initiates SCSI 
commands to the SCSI targets (usually disks and such). However, the majority of HBAs 
use the same SCSI processor chips as the disk drive controllers, they are only 
programmed differently. It is relatively easy to reprogram most of HBAs to operate as a 
target, which responds to SCSI commands from other initiators. Target mode driver is 
HBA specific; multiple instances of them maybe present for each HBA in the system. 



Figure 2 


Target mode server is a hardware-independent kernel module, which interfaces to both 
target mode driver and local I/O subsystem. Its function is somewhat similar to the kernel 
portion of the NFS sever code as implemented in Solaris. It receives I/O requests from 


6 


Storage Appliance Architecture 










the drivers, performs necessary queuing buffering and scheduling, and sends the request 
to the I/O subsystem - in this case to the VERITAS File System (VxFS) with Quick 
I/O™ and VERITAS Volume Manager (VxVM). 

VxVM implements the low level disk management; it provides disk mirroring and RAID 
functionality. Veritas File System (VxFS) is a high performance file system which also 
support a number of important advanced features for data management: 

• Fileset clones: VxFS can create and maintain multiple historical images of user data 
and file system metadata. 

• Online backup: 

• Incremental backup at the block level, only the data blocks that have actually changed 
will be backed up 

These features can be easily used to implement the requirements for the advanced storage 
system as described in the first section of this paper. 

VERITAS Quick I/O™ technology was developed to optimize operation of relational 
database systems (like Oracle™) when used with VxFS. The design and implementation 
of the technology is described in [9]. It turns out the problems solved during development 
of Quick I/O™ are very close and in many cases identical to that of storage device 
impersonation on the external storage interconnect as required by the storage appliance. 

Quick I/O ™ supports multiple outstanding I/O requests to the same file, can be used in 
“write-through/read cache” mode, support direct and asynchronous I/O. All these features 
are also required for efficient implementation of the device access. 

Standard OS environment makes it easy to integrate other important functions. High 
availability can be achieved by building a failover cluster with standard HA software; 
integrated backup is done by using any of the well-known backup and restore utilities. 

Current Status and Preliminary Results. 

We have built a prototype of the storage appliance software package as described above 
as described above. The prototype exists for both Sparc and x86 versions of Solaris 2.6. 
For the SCSI target mode HBA we have used Performance Technologies PTI450-D 
(UltraWide Differential SCSI PCI HBA), at the time of this writing the work is going on 
to implement target mode drivers for Performance Technologies Fibre Channel HBA and 
for the JNI Fibre Channel HBA. The first version of the target mode server code has been 
implemented and we have run a number of preliminary benchmarks to measure the 
quality of the implementation. 

We have used database load simulation benchmark; we have measured the dependency of 
the CPU utilization in the appliance vs. throughput. The figure 3 illustrates the results. 
CPU utilization is shown on the X-axis in percents as reported by sar (1) utility. Note 
that all times are system times, as the code has no user space components. The Y-axis 
shown the throughput as percentage of the maximum theoretical bandwidth of the 
internal system bus in the appliance. The last data point (after 80% CPU utilization) is not 


Storage Appliance Architecture 


7 



valid as the results are not repeatable, we can only say that the throughout goers down 



substantially. We used a DELL 233 MHz Pentium II server, 512Mbytes of RAM with 
one PCI Bus and 32GB of disks (8 spindles 4 GB each) striped over running Solaris 2.6 
for Intel. As we have had only one PCI bus the theoretical maximum bandwidth was 
considered to be half of the PCI spec (100 MB/Sec). For system with multiple PCI buses 
the maximum bandwidth would be equal that of the single PCI bus spec. Obviously, one 
can construct elaborate configurations by striping data access across multiple target 
HBAs and multiple disk controllers and multiple internal buses. We are currently running 
these experiments with Sun Ultra 450 (4 PCI buses). 

These results are very encouraging: they indicate that Solaris OS can easily reach over 
80% the maximum theoretical hardware bandwidth using less than 25% of the CPU. 
Therefore the limit to the performance of the storage appliance is not software but the 
hardware, and no RTOS implementation is going to perform any better than an open 
system. 


8 


Storage Appliance Architecture 























Conclusions. 


We are continuing to study the prototype but even at this stage it is obvious that we can 
build a marketable product. The next step is to bring our Fibre Channel target drivers into 
the picture and rerun the benchmarks. We also would like to try open operating systems 
other than Solaris, most importantly FreeBSD and possibly Linux. We would encourage 
other members of the Linux and FreeBSD development and user community to do the 
same. 

On the whole the advanced storage subsystems seem to be yet another area of computing 
where open software can make a difference and substantially bring down the costs. 

Acknowledgements. 

The author would like to thank all members of the VERITAS team who is working with 
him on this project, in particular Tony Lukes, whose professionalism, experience and 
attention to detail made this project a reality. 

References: 

[1] David A. Patterson, Garth Gibson, Randy H. Katz. "A Case for Redundant Arrays of 
Inexpensive Disks". University of California Berkeley, 1988. 

[2] The RAID Book: A storage system technology handbook, 6th Edition, RAID Advisory 
board 1997. 

[3] RAID Advisory Board, www.raid-advisorv.com 

[4] Gode Davis. "Is Our Data Safe Yet ?", SunExpert, Vol. 9, No. 5, May 1998. 

[5] EMC Corporation, www.emc.com 

[6] Network Appliance Corporation, www.netapps.com 

[7] Veritas Software Corporation www.veritas.com 

[8] Gregory Pfister. "In Search of Clusters: The Ongoing Battle in the Lowly Parallel 
Computing". Second edition. 1998. 

[9] Alex Miroshnichenko. "Beyond mkfs: Extending File Systems to Address the Needs of 
Large Scale Mission-Critical Applications". AUUG97 Brisbane, 1997. 


About the author. 

Alex Miroshnichenko is a Director of Software Engineering at VERITAS Software 
Corporation in Mountain View, CA. He has been with VERITAS for almost 6 years and 
spent most of that time developing Veritas File System technology and looking for 
various ways to apply it beyond the traditional areas. Currently he is responsible for new 
technology development in the area of intelligent storage subsystems and storage area 
networks. Alex has been working on various aspects of operating systems designs and 
data storage management for over 13 years. He can be reached at alex@veritas.com 


Storage Appliance Architecture 


9 


































































10 Storage Appliance Architecture 









The Vinum Volume Manager 

Greg Lehey 

Nan Yang Computer Services Ltd. 

PO Box 460 
Echunga SA 5153 
grog@lemis.com 
08-8388-8286 


ABSTRACT 

This paper describes the Vinum Volume Manager, a block device driver which imple¬ 
ments virtual disk drives. It isolates disk hardware from the block device interface and 
maps data in ways which result in an increase in flexibility, performance and reliability 
compared to the traditional slice view of disk storage. Vinum implements the RAID-0, 

RAID-1 and RAID-5 models, both individually and in combination. 

Introduction 

Despite the rapid evolution of disk hardware, the current UNIX disk abstraction is inade¬ 
quate for a number of modem applications. In particular, file systems must be stored on a 
single disk partition (or volume), and there is no kernel support for redundant data stor¬ 
age. In addition, the direct relationship between disk volumes and their location on disk 
make it generally impossible to enlarge a disk volume once it has been created. Perfor¬ 
mance can often be limited by the maximum data rate which can be achieved with the 
disk hardware. 

The largest modem disks store only about 30 GB, but large installations routinely have 
more than a terabyte of disk storage, and it is not uncommon to see disk storage of sever¬ 
al hundred gigabytes even on PCs. Storage-intensive applications such as Internet World- 
Wide Web and FTP servers have accelerated the demand for high-performance, high-vol- 
ume, reliable storage systems. 

The current trend is to realize such systems in disk array hardware, which looks like a 
very large disk to the host system, but which spreads the data over a number of disks, 
possibly in a redundant fashion such as RAID-1 or RAID-5. Disk arrays have a number 
of advantages: 


The Vinum Volume Manager 


11 



• They are portable. Since they have a standard interface, usually SCSI, but sometimes 
DDE, they can be installed on almost any system without kernel modifications. 

• They can offer impressive performance: they offload the calculations (in particular, the 
parity calculations for RAID-5) to the array, and in the case of replicated data, the ag¬ 
gregate transfer rate to the array is less than it would be to local disks. RAID-0 (strip¬ 
ing) and RAID-5 organizations also spread the load more evenly over the physical 
disks, thus improving performance. Nevertheless, an array is typically connected via 
a single SCSI connection, which can be a bottleneck. 

• They are reliable. A good disk array offers a large number of features designed to en¬ 
hance reliability, including enhanced cooling, hot-plugging (the ability to replace a 
drive while the array is running) and automatic failure recovery. 

On the other hand, disk arrays are relatively expensive and not particularly flexible. An 
alternative is a software-based volume manager which performs similar functions in soft¬ 
ware. A number of these systems exist, notably the VERITAS® volume manager [Veri¬ 
tas], Solaris DiskSuite [Solstice], IBM’s Logical Volume Facility [IBM] and SCO’s Virtu¬ 
al Disk Manager [SCO]. An implementation of RAID software is also available for Lin¬ 
ux [Linux]. 

Vinum 

Vinum is an open source [OpenSource] volume manager implemented under FreeBSD 
[FreeBSD]. A number of features distinguish it from commercial volume managers: 

• Vinum implements RAID-0 (striping), RAID-1 (mirroring) and RAID-5 (rotated 
block-interleaved parity). In RAID-5, a group of disks are protected against the fail¬ 
ure of any one disk by an additional disk with block checksums of the other disks. 1 

• Drive layouts can be combined to increase robustness, including striped mirrors (so- 
called “RAID-10”). 

• Vinum implements only those features which appear useful. Some commercial vol¬ 
ume managers appear to have been implemented with the goal of maximizing the size 
of the spec sheet. Vinum does not implement “ballast” features such as RAID-4, al¬ 
though it would have been trivial to do so. 

• Volume managers initially emphasized reliability and performance rather than ease of 
use. The results are frequently down time due to misconfiguration, with consequent 
reluctance on the part of operational personnel to attempt to use the more unusual fea- 


1. The RAID-5 functionality is currently available under license from Cybernet, Inc. [Cybernet], It will 
be released as open source at a later date. 


12 


The Vinum Volume Manager 



tures of the product. Vinum attempts to provide an easier-to-use non-GUI interface. 


Concepts 

As used in this document, a volume manager is a software component which isolates file 

systems from the underlying disk hardware. Instead of building file systems on disk par¬ 
titions, they are built on logical disks, called volumes. This has a number of advantages: 

• Volumes may span disk drives. 

• Volumes may be larger than any drive. 

• By spreading the disk load over multiple volumes, it is possible to improve perfor¬ 
mance. 

• By replicating data within the volume, it is possible to improve availability. 

• By changing the volume configuration, it is possible to reorganize disk storage on¬ 
line. 

• It is possible to extend the size of volumes. 

To achieve these results, Vinum defines a hierarchy of four logical objects: 

• A volume is a logical disk. From a user viewpoint, it is almost indistinguishable from 
a disk partition, the conventional representation of a logical disk. A volume contains 
one or more plexes. 

• A plex is a representation of the data in a volume. Each plex has an address space the 
same size as the size of the volume, though it is not required that the address space be 
completely mapped to disk storage. Each plex represents a (possibly incomplete) 
copy of the data in the volume, thus providing protection against disk failure. This 
represents an implementation of RAID-1. 

• Each plex contains one or more subdisks. A subdisk is a contiguous segment of disk 
storage. Conceptually, it is similar to a disk slice, but the implementation is different. 
In particular, a disk slice contains metadata such as labels, while a subdisk does not. 
A plex can be extended in length by adding subdisks to it: since subdisks can be locat¬ 
ed on any device under Vinum control, there is no requirement for contiguous free 
space in order to expand a plex. 

• A drive is a hardware component which may contain subdisks. From an implementa¬ 
tion viewpoint, it may be a complete device or a disk slice. Vinum does not depend 
on any particular disk hardware, though it is intended for use primarily on convention- 


The Vinum Volume Manager 


13 



al hard disks. 


Plex organization 

Subdisks may be mapped to plexes in one of three different ways. The following figures 
illustrate the possible ways of mapping blocks of 4096 (0x1000) bytes in a plex with 
four subdisks. 

• A concatenated plex maps the subdisks to the plex sequentially, corresponding to 
RAID-0. The plex maps to the complete address space of each subdisk in turn. In a 
concatenated plex, subdisks do not need to be of equal size. 

Offset Subdisk 1 Subdisk 2 Subdisk 3 Subdisk 4 


0x0000 
0x1000 
0x2000 
0x3000 
0x4000 
0x5000 

Figure 1: Concatenated plex 


0x0000 



0x6000 


r-* 

OxcOOO 



0x12000 

0x1000’ 



0x7 000 1 



OxdOOO’ 



0x13000 

0x2000’ 



0x8000 1 



OxeOOoJ 



0x14000 

0x3000' 



0x9000’ 



OxfOOOJ 



0x15000 

0x4000j 



OxaOOOj 



0x10000 



0x16000 

0x5000' 


OxbOOO’ 



0x11000 



0x17000 


The dotted lines in this figure represetn the logical blocks in order to illustrate the 
differences from the other organizations. 

• A striped plex maps the plex address space to equal-sized blocks of each subdisk in 
turn. As a result, the subdisks in a plex must all have the same size: 


14 


The Vinum Volume Manager 



























Offset 


Subdisk 1 


Subdisk 2 


Subdisk 3 


Subdisk 4 


0x0000 

0x1000 

0x2000 

0x3000 

0x4000 

0x5000 



Figure 2: Striped plex 


• As implemented by Vinum, a RAIDS plex is similar to a striped plex, except that it 
implements RAID-5 by including a parity block in each stripe. As required by 
RAID-5, the location of this parity block changes from one stripe to the next: 


Offset Subdisk 1 Subdisk 2 Subdisk 3 Subdisk 4 


0x0000 

0x1000 

0x2000 

0x3000 

0x4000 

0x5000 



Figure 3: RAID-5 plex 


Which plex organization? 

Vinum implements only that subset of RAID organizations which make sense in the 
framework of the implementation. It would have been possible to implement all RAID 
levels, but there was no reason to do so. Each of the chosen organizations has unique 
advantages: 


The Vinum Volume Manager 


15 











• Concatenated plexes are the most flexible: they can contain any number of subdisks, 
and the subdisks may be of different length. The plex may be extended by adding 
additional subdisks. They require less CPU time than striped or RAID-5 plexes, 
though the difference in CPU overhead from striped plexes is not significant. On the 
other hand, they are most susceptible to hot spots, where one disk is very active and 
others are idle. 

• The greatest advantage of striped (RAID-0) plexes is that they reduce hot spots: by 
choosing an optimum sized stripe (empirically determined to be in the order of 256 
kB), the load on the component drives can be made more even. The disadvantages of 
this approach are (fractionally) more complex code and restrictions on subdisks: they 
must be all the same size, and extending a plex by adding new subdisks is so 
complicated that Vinum does not implement it. An additional, trivial restriction is that 
a striped plex must have at least two subdisks, since otherwise it is indistinguishable 
from a concatenated plex. 

• The implementation of RAID-5 plexes stretches the concept of the volume manager 
somewhat. While RAID-1 is implemented at the volume level, RAID-5 is 
implemented at the plex level. As implemented, RAID-5 plexes are effectively an 
extension of striped plexes. Compared to striped plexes, they offer the advantage of 
fault tolerance, but the disadvantages of higher storage cost and significantly higher 
CPU overhead, particularly for writes. The code is an order of magnitude more 
complex than for concatenated and striped plexes. Like striped plexes, RAID-5 plexes 
must have equal-sized subdisks and cannot be extended. Vinum enforces a minimum 
of three subdisks for a RAID-5 plex, since any smaller number would not make any 
sense. 

The following table gives an overview of the advantages and disadvantages of each plex 

organization. 


Plex type 

Minimum 

subdisks 

can 

add 

subdisks 

must be 

equal 

size 

application 

concatenated 

1 

yes 

no 

Large, non-redundant data 
storage 

striped 

2 

no 

yes 

Highly concurent access 

RAID-5 

3 

no 

yes 

Highly reliable storage, 
primarily read access 


Figure 4: Vinum plex organizations 


16 


The Vinum Volume Manager 




These are not the only possible organizations. In addition, the following could have been 
implemented: 

• RAID-4, which differs from RAID-5 only by the fact that all parity data is stored on a 
specific disk. This simplifies the algorithms somewhat at the expense of drive 
utilization: the activity on the parity disk is a direct function of the read to write ratio. 
Since Vmum implements RAID-5, RAID-4’s only advantage is nullified. 

• RAID-3, effectively an implementation of RAID-4 with a stripe size of one byte. 
Each transfer requires reading each disk (with the exception of the parity disk for 
reads). Without spindle synchronization (where the corresponding sectors pass the 
heads of each drive at the same time), RAID-3 would be very inefficient. In a 
multiple-access system, it also causes high latency. 

• RAID-2, which uses two subdisks to store a Hamming code, and which otherwise 
resembles RAID-3. Compared to RAID-3, it offers a lower data density, higher CPU 
usage and no compensating advantages. 

In addition, RAID-5 can be interpreted in two different ways: the data can be striped, as 
in the Vmum implementation, or it can be written serially, exhausting the address space 
of one subdisk before starting on the other, effectively a modified concatenated 
organization. There is no recognizable advantage to this approach, since it does not 
provide any of the other advantages of concatenation. 

Configuring Vinum 

Vinum maintains a configuration database which describes the objects known to an 
individual system. Initially, the user creates the configuration database from one or more 
configuration files with the aid of the vinum(8) utility program. Vmum stores a copy of 
its configuration database on each disk slice (which Vinum calls a device ) under its 
control. This database is updated on each state change, so that a restart accurately 
restores the state of each Vmum object. 

The configuration file 

The configuration file describes individual Vmum objects. The definition a the simple 
volume might be: 


drive a device /dev/sdOh 
drive b device /dev/sdlh 
drive c device /dev/sd2h 
drive d device /dev/sd3h 


The Vinum Volume Manager 


17 



volume myvol 

plex org concat 

sd length 512m drive a 
sd length 512m drive b 
plex org concat 

sd length 512m drive c 
sd length 512m drive d 


This file describes a total of 11 Vmum objects: 

• The drive line describe four disk partitions ( drives ) and their location relative to the 
underlying hardware. They are given the symbolic names a, b, c and d. 

• The volume line describes a volume. The only required attribute is the name, in this 
case myvol. 

• The plex lines define plexes. The only required parameter is the organization, in this 
case concat. No name is necessary: the system automatically generates a name 
from the volume name by adding the suffix . px, where x is the number of the plex in 
the volume. Thus the first plex will be called myvol.pO and the second will be called 
myvol.pl. 

• The sd lines describe subdisks. The minimum specifications are the name of a driver 
on which to store it, and the length of the subdisk. As with plexes, no name is 
necessary: the system automatically assigns names derived from the plex name by 
adding the suffix . sx, where x is the number of the subdisk in the plex. Thus Vmum 
gives these four subdisks the names myvol.pO.sO, myvol.pO.si, myvol.pl.sO and 
myvol.pl.si respectively. 

After processing this file, vinum(8) produces the following output: 


vinum -> create configl 

Configuration summary 


Drives: 

4 

(4 

configured) 

Volumes: 

1 

(4 

configured) 

Plexes: 

2 

(8 

configured) 

Subdisks: 

4 

(16 

configured) 

D a 



State: 

up 

D b 



State: 

up 

D c 



State: 

up 

D d 



State: 

up 

V myvol 



State: 

up 

P myvol.pO 



C State: 

up 

P myvol.pl 



C State: 

up 

S myvol.pO.sO 



State: 

up 

S myvol.pO.si 



State: 

up 


Device /dev/sdOh 
Device /dev/sdlh 
Device /dev/sd2h 
Device /dev/sd3h 


Plexes: 


2 

Size: 

1024 

MB 

Subdisks: 


2 

Size: 

1024 

MB 

Subdisks: 


2 

Size: 

1024 

MB 

PO: 

0 

B 

Size: 

512 

MB 

PO: 

512 

MB 

Size: 

512 

MB 


18 


The Vinum Volume Manager 



S myvol.pl.sO 
S myvol.pl.si 


State: up 
State: up 


PO: 

PO: 


0 B Size: 
512 MB Size: 


512 MB 
512 MB 


This output shows the brief listing format of vinwn(8). 

The following figure demonstrates the layout in graphic form: 



In this representation, each plex covers the complete address space of the volume. 
Subdisks myvol.pO.sO and myvol.pl.sO cover the first half of the address space, and 
subdisks myvol.pO.sl and myvol.pl.si cover the second half of the address space. 

Since Vinum stores these definitions in the configuration database, there is never any need 
to define them again in another configuration file. At a later date the user might want to 
create another volume and add a plex to the existing volume vol. In order to do so, he 


The Vinum Volume Manager 


19 
















might need to add another drive to the Vinum configuration. He could perform these 
tasks with the following configuration file: 


plex volume myvol org striped 256k 
sd size 256m drive a 
sd size 256m drive b 
sd size 256m drive c 
sd size 256m drive d 
drive e device /dev/sd4h 
volume bigraid 

plex org raid5 256k 
sd size 2g drive a 
sd size 2g drive b 
sd size 2g drive c 
sd size 2g drive d 
sd size 2g drive e 


In this example, the user first adds another plex to volume myvol. The system 
automatically assigns it the name myvol.p2. The plex contains four subdisks, and the data 
is striped across the subdisks in stripes of 256 kB. 

The new volume is called bigraid and contains one plex with a RAID-5 organization and 
five subdisks. Again the stripe size is 256 kB. In order to accomodate the fifth subdisk, 
the user defines a fifth drive, drive e, which is on the disk partition /dev/sd4h. He doesn’t 
need to define the others, since they are already known to the system. 

After running vinum(8), the configuration looks like: 


vinum -> create config2 

Configuration Summary- 


Drives : 
Volumes: 
Piexes: 
Subdisks 


5 (8 configured) 

2 (4 configured) 

4 (8 configured) 

13 (16 configured) 


D a 


State: 

up 

D b 


State: 

up 

D c 


State: 

up 

D d 


State: 

up 

D e 


State: 

up 

V myvol 


State: 

up 

V bigraid 


State: 

up 

P myvol.pO 

C 

State: 

up 

P myvol.pi 

C 

State: 

up 

P myvol.p2 

S 

State: 

up 

P bigraid.pO 

R5 

State: 

init 

S myvol.pO.sO 


State: 

up 

S myvol.pO.si 


State: 

up 

S myvol.pi.sO 


State: 

up 

S myvol.pi.si 


State: 

up 


Device /dev/sdOh 
Device /dev/sdlh 
Device /dev/sd2h 
Device /dev/sd3h 
Device /dev/sd4h 


Plexes: 


3 

Size: 

1024 

MB 

Plexes: 


1 

Size: 

8 

GB 

Subdisks 


2 

Size: 

1024 

MB 

Subdisks 


2 

Size: 

1024 

MB 

Subdisks 


4 

Size: 

1024 

MB 

Subdisks 


5 

Size: 

8 

GB 

PO: 

0 

B 

Size: 

512 

MB 

PO: 

512 

MB 

Size: 

512 

MB 

PO: 

0 

B 

Size: 

512 

MB 

PO: 

512 

MB 

Size: 

512 

MB 


20 


The Vinum Volume Manager 





S myvol.p2.sO 

State: 

up 

PO 

0 

B 

Size: 

256 

MB 

S myvol.p2.si 

State: 

up 

PO 

256 

MB 

Size: 

256 

MB 

S myvol.p2.s2 

State: 

up 

PO 

512 

MB 

Size: 

256 

MB 

S myvol.p2.s3 

State: 

up 

PO 

768 

MB 

Size: 

256 

MB 

S bigraid.pO.sO 

State: 

init 

PO 

0 

B 

Size: 

2048 

MB 

S bigraid.pO.si 

State: 

init 

PO 

2048 

MB 

Size: 

2048 

MB 

S bigraid.pO.s2 

State: 

init 

PO 

4096 

MB 

Size: 

2048 

MB 

S bigraid.pO.s3 

State: 

init 

PO 

6144 

MB 

Size: 

2048 

MB 

S bigraid.pO.s4 

State: 

init 

PO 

8192 

MB 

Size: 

2048 

MB 


Startup 

The configuration information is stored on the disk slices in essentially the same form as 
in the configuration files, which enables the same routines to be used for initialization. 
When reading from the configuration database, vinum(8) recognizes a number of 
keywords which are not allowed in the configuration files. At a point where the user has 
commenced initialization of plex bigraid.pO, the configuration database might contain: 

drive a state up device /dev/sdOh 
drive b state up device /dev/sdlh 
drive c state up device /dev/sd2h 
drive d state up device /dev/sd3h 
drive e state up device /dev/sd4h 
volume myvol state up 
volume bigraid state down 

plex name myvol.pO state up org concat vol myvol 

plex name myvol.pl state up org concat vol myvol 

plex name myvol.p2 state init org striped 512b vol myvol 

plex name bigraid.pO state initializing org raid5 512b vol bigraid 

sd name myvol.pO.sO drive a plex myvol.pO state up len 1048576b driveoffset 265 

b plexoffset Ob 

sd name myvol.pO.si drive b plex myvol.pO state up len 1048576b driveoffset 265 
b plexoffset 1048576b 

sd name myvol.pi.sO drive c plex myvol.pl state up len 1048576b driveoffset 265 
b plexoffset Ob 

sd name myvol.pi.si drive d plex myvol.pl state up len 1048576b driveoffset 265 
b plexoffset 1048576b 

sd name myvol.p2.s0 drive a plex myvol.p2 state init len 524288b driveoffset 10 
48841b plexoffset Ob 

sd name myvol.p2.si drive b plex myvol.p2 state init len 524288b driveoffset 10 
48841b plexoffset 524288b 

sd name myvol.p2.s2 drive c plex myvol.p2 state init len 524288b driveoffset 10 
48841b plexoffset 1048576b 

sd name myvol.p2.s3 drive d plex myvol.p2 state init len 524288b driveoffset 10 
48841b plexoffset 1572864b 

sd name bigraid.pO.sO drive a plex bigraid.pO state initializing len 4194304b d 
riveoffset 1573129b plexoffset 0b 

sd name bigraid.pO.si drive b plex bigraid.pO state initializing len 4194304b d 
riveoffset 1573129b plexoffset 4194304b 

sd name bigraid.pO.s2 drive c plex bigraid.pO state initializing len 4194304b d 
riveoffset 1573129b plexoffset 8388608b 

sd name bigraid.pO.s3 drive d plex bigraid.pO state initializing len 4194304b d 
riveoffset 1573129b plexoffset 12582912b 

sd name bigraid.pO.s4 drive e plex bigraid.pO state initializing len 4194304b d 


The Vinum Volume Manager 


21 





riveoffset 1573129b plexoffset 16777216b 


The obvious differences here are the presence of explicit location information and naming 
(both of which are also allowed, but discouraged, for use by the user) and the information 
on the states (which are not available to the user). 

At system startup, Vinum reads the configuration database from one of the Vinum drives. 
Under normal circumstances, each drive contains an identical copy of the configuration 
database, so it does not matter which drive is read. After a crash, however, Vinum must 
determine which drive was updated most recently and read the configuration from this 
drive. 


Object naming 

As described above, Vinum assigns default names to plexes and subdisks, although they 
may be overridden. Overriding the default names is not recommended: experience with 
the VERITAS volume manager, which allows arbitary naming of objects, has shown that 
this flexibility does not bring a significant advantage, and it can cause confusion. 

Names may contain any non-blank character, but it is recommended to restrict them to 
letters, digits and the underscore characters. The names of volumes, plexes and subdisks 
may be up to 64 characters long, and the names of drives may up to 32 characters long. 

After reading the configuration database, vinum(8) creates a directory /dev/vinum, in 
which it makes device entries for each volume it finds. It also creates subdirectories 
/dev/vinum/vol, /dev/vinum/plex, /dev/vinum/sd and / dev./vinum/drive , in which it stores 
device entries for the corresponding objects. The directories /dev/vinum/vol and 
/dev/vinum/plex contain subdirectories representing the hierarchy of the plexes and 
subdisks attached to them. 

After processing the first of the configuration files described above, Vinum creates the 
following devices: 


/dev/vinum: 
total 5 


brwx- 

1 

root 

wheel 25, 

0x40000000 

Jul 28 10:57 

drwxr-xr-x 

2 

root 

wheel 

512 

Jul 28 

10:57 drive 

drwxr-xr-x 

2 

root 

wheel 

512 

Jul 28 

10:57 plex 

drwxr-xr-x 

2 

root 

wheel 

512 

Jul 28 

10:57 sd 

drwxr-xr-x 

3 

root 

wheel 

512 

Jul 28 

10:57 vol 

/dev/vinum/drive: 





total 0 
brw-r- 

1 

root 

operator 

4, 

39 May 

25 12:32 a 

brw-r- 

1 

root 

operator 

4, 

15 May 

24 14:05 b 


22 


The Vinum Volume Manager 






brw-r- 1 root operator 

brw-r- 1 root operator 

/dev/vinum/plex: 
total 0 


brwxr-xr— 

1 

root 

wheel 

25, 

brwxr-xr-- 

1 

root 

wheel 

25, 

/dev/vinum/sd: 




total 0 
brwxr-xr-- 

1 

root 

wheel 

25, 

brwxr-xr-- 

1 

root 

wheel 

25, 

brwxr-xr— 

1 

root 

wheel 

25, 

brwxr-xr— 

1 

root 

wheel 

25, 

/dev/vinum/vol: 



total 1 
brwxr-xr— 

1 

root 

wheel 

25, 

drwxr-xr-x 

4 

root 

wheel 


/dev/vinum/vol/myvol.plex: 


total 2 
brwxr-xr— 

1 

root 

wheel 

25, 

drwxr-xr-x 

2 

root 

wheel 


brwxr-xr-- 

1 

root 

wheel 

25, 

drwxr-xr-x 

2 

root 

wheel 



4, 23 May 24 14:05 c 

4, 31 May 24 14:05 d 


0x10000000 Jul 28 
0x10010000 Jul 28 


0x20000000 Jul 28 
0x20100000 Jul 28 
0x20010000 Jul 28 
0x20110000 Jul 28 


10:57 myvol.pO 
10:57 myvol.pl 


10:57 myvol.pO.sO 
10:57 myvol.pO.si 
10:57 myvol.pl.s0 
10:57 myvol.pl.si 


0 Jul 28 10:57 myvol 
512 Jul 28 10:57 myvol.plex 


0x10000000 Jul 28 10:57 myvol.pO 
512 Jul 28 10:57 myvol.pO.sd 
0x10010000 Jul 28 10:57 myvol.pl 
512 Jul 28 10:57 myvol.pi.sd 


/dev/vinum/vol/myvol.plex/myvol.p0.sd: 
total 0 

brwxr-xr— 1 root wheel 25, 0x20000000 Jul 28 10:57 myvol.pO.sO 
brwxr-xr— 1 root wheel 25, 0x20100000 Jul 28 10:57 myvol.pO.si 


/dev/vinum/vol/myvol.plex/myvol.pi.sd: 
total 0 

brwxr-xr-- 1 root wheel 25, 0x20010000 Jul 28 10:57 myvol.pi.sO 
brwxr-xr-- 1 root wheel 25, 0x20110000 Jul 28 10:57 myvol.pi.si 


Unlike UNIX drives, Vmum volumes are not partitioned, and thus do not contain a 
partition table. This has required modification to some disk utilities, notably newfs , 
which previously tried to interpret the last letter of a Vmum volume name as a slice 
identifier. 

Although it is recommended that plexes and subdisks should not be allocated specific 
names, Vmum drives must be named. This makes it possible to move a drive to a 
different location and still recognize it automatically. Drive names may be up to 32 
characters long. 


The Vinum Volume Manager 


23 





The implementation 

At the time of writing, Vinum is still in a late implementation stage. Many aspects of the 
implementation are subject to change. This section examines some of the more 
interesting tradeoffs in the implementation. 

Where the driver fits 

To the operating system, Vinum looks like a block device, so it is normally be accessed as 
a block device. Instead of operating directly on the device, it creates new requests and 
passes them to the real device drivers. Conceptually it could pass them to other Vinum 
devices, though this usage makes no sense and would probably cause problems. The 
following figure, borrowed from [McKusick], 2 shows the standard 4.4BSD I/O structure: 


system call interface to the kernel 

active file entries 

socket 

VNODE layer 

VM 

NFS 

local naming (UFS) 

special devices 

network 

protocols 


- FFS 

MFS 


cooked 

disk 

raw 

disk 

and 

tty 


swap 

space 

mgmt. 

network 

interface 

drivers 

buffer cache 


block device driver 

character device driver 

the hardware 


Figure 6: Kernel I/O structure, after McKusick et. al. 


The following figure shows the I/O structure in FreeBSD after installing Vinum. Apart 
from the effect of Vinum, it shows the gradual lack of distinction between block and 
character devices that has occurred since the release of 4.4BSD: FreeBSD implements 
disk character devices in the corresponding block driver. 


2. This figure is © 1996 Addison-Wesley, and is reproduced with permission. 


24 


The Vinum Volume Manager 









system call interface to the kernel 

active file entries 

socket 

VNODE layer 

VM 

NFS 

local naming (UFS) 

special devices 

network 

protocols 



LFS 

cooked 

disk 

tty 

line 

discipline 

raw 

disk 

and 

tty 

swap 

space 

mgmt. 

network 

interface 

buffer cache 


Vinum block dev 


charac 

Vinum char 

drivers 

block device driver 

ter device driver 

the hardware 


Figure 7: Kernel I/O structure with Vinum 


Design limitations 

Vinum was intended to have as few arbitrary limits as possible consistent with an efficient 
implementation. Nevertheless, a number of limits were imposed in the interests of 
efficiency, mainly in connection with the device minor number format. These limitations 
will no longer be relevant after the introduction of a device file system. 

Restriction Reasoning 

Fixed maximum number In order to maintain compatibility with other versions of 
of volumes per system. UNIX, it was considered desirable to keep the device 

numbers of volume in the traditional 8+8 format (8 bits 
major number, 8 bits minor number). This restricts the 
number of volumes to 256. In view of the fact that Vinum 
provides for larger volumes than disks, and current 
machines are seldom able to control more than 64 disk 
drives, this restriction seems unlikely to become severe for 
some years to come. 


The Vinum Volume Manager 


25 









Fixed number of plexes Plexes supply redundancy according to RAID-1. For this 
per volume purpose, two plexes are sufficient under normal 

circumstances. For rebuilding and archival purposes, 
additional plexes can be useful, but it is difficult to find a 
situation where more than four plexes are necessary or 
useful. On the other hand, additional plexes beyond four 
bring little advantage for reading and a significant 
disadvantage for writing. I believe that eight plexes are 
ample. 

Fixed maximum number For similar reasons, the number of subdisks was limited to 
of subdisks per plex. 256. It seldom makes sense to have more than about 10 

subdisks per plex, so this restriction does not currently 
appear severe. There is no specific overall limitation on the 
number of subdisks. 

Minimum device size A device must contain at least 1 MB of storage. This 

assumption makes it possible to dispense with some 
boundary condition checks. Vmum requires 133 kB of disk 
space to store the header and configuration information, so 
this restriction does not appear serious. 

Memory allocation 

In order to perform its functionality, Vinum allocates a large number of dynamic data 
structures. Currently these structures are allocated by calling kernel malloc. This is a 
potential problem, since malloc interacts with the virtual memory system and may 
trigger a page fault. The potential for a deadlock exists if the page fault requires a 
transfer to a Vinum volume. It is probably that Vinum will modify its allocation strategy 
by reserving a small number of buffers when it starts and using these if a malloc 
request fails. 

To cache or not to cache 

Traditionally, UNIX block devices are accessed from the file system via caching routines 
such as bread and bwrite. It is also possible to access them directly, but this facility is 
seldom used. The use of caching enables significant improvements in performance. 

Vinum does not cache the data it passes to the lower-level drivers. It would also seem 
counterproductive to do so: the data is available in cache already, and the only effect of 


26 


The Vinum Volume Manager 



caching it a second time would be to use more memory, thus causing more frequent cache 
misses. 

RAID-5 plexes pose a problem to this reasoning. A RAID-5 write normally first reads 
the parity block, so there might be some advantage in caching at least the parity blocks. 
This issue has been deferred for further study. 

Access optimization 

The algorithms for RAID-5 access are surprisingly complicated and require a significant 
amount of temporary data storage. To achieve reasonable performance, they must take 
error recovery strategies into account at a low level. A RAID 5 access can require one or 
more of the following actions: 

• Normal read. All participating subdisks are up, and the transfer can be made directly 
to the user buffer. 

• Recovery read. One participating subdisk is down. To recover data, all the other 
subdisks, including the parity subdisk, must be read. The data is recovered by 
exclusive-oring all the other blocks. 

• Normal write. All the participating subdisks are up. This write proceeds in four 
phases: 

1. Read the old contents of each block and the parity block. 

2. “Remove” the old contents from the parity block with exclusive or. 

3. “Insert” the new contents of the block in the parity block, again with exclusive 
or. 

4. Write the new contents of the data blocks and the parity block. The data block 
transfers can be made directly from the user buffer. 

• Degraded write where the data block is not available. This requires the following 
steps: 

1. Read in all the other data blocks, excluding the parity block. 

2. Recreate the parity block from the other data blocks and the data to be written. 

3. Write the parity block. 

• Parityless write , a write where the parity block is not available. This is in fact the 
simplest: just write the data blocks. This can proceed directly from the user buffer. 


The Vinum Volume Manager 


27 



Combining access strategies 

In practice, a transfer request may combine the actions above. In particular: 

• A read request may request reading both available data (normal read) and non- 
available data (recovery read). This can be a problem if the address ranges of the two 
reads do not coincide: the normal read must be extended to cover the address range of 
the recovery read, and must thus be performed out of malloced memory. 

• Combination of degraded data block write and normal write. The address ranges of 
the reads may also need to be extended to cover all participating blocks. 

An exception exists when the transfer is shorter than the width of the stripe and is spread 
over two subdisks. In this case, the subdisk addresses do not overlap, so they are 
effectively two separate requests. 

Examples 

The following examples illustrate these concepts: 

Offset 

0x0000 


0x1000 


0x2000 


0x3000 


Subdisk 1 Subdisk 2 Subdisk 3 Subdisk 4 Subdisk 5 


0x0000 

0x1000 

0x2000 

0x3000 

Parity 

0x4000 

0x5000 

0x6000 

Parity 

0x7000 

0x8000 

0x9000 

Parity 

OxaOOO 

.• 

■ 

OxbOOO 

OxcOOO 

Parity 

OxdOOO 

OxeOOO 

Oxf000 

Parity 

0x10000 

0x11000 

0x12000 

0x13000 



Parity block 


Data block involved in transfer 


Figure 8: A sample RAID-5 transfer 

This diagram illustrates a number of typical points about RAID-5 transfers. It shows the 


28 


The Vinum Volume Manager 





beginning of a plex with five subdisks and a stripe size of 4 kB. The shaded area shows 
the area involved in a transfer of 4.5 kB (9 sectors), starting at offset 0xa8 0 0 in the plex. 
A read of this area generates two requests to the lower-level driver: 4 sectors from 
subdisk 4, starting at offset 0x2800, and 5 sectors from subdisk 5, starting at offset 
0x2000. 

Writing this area is significantly more complicated. From a programming standpoint, the 
simplest approach is to consider the transfers individually. This would create the 
following requests: 

• Read the old contents of 4 sectors from subdisk 4, starting at offset 0x2800. 

• Read the old contents of 4 sectors from subdisk 3 (the parity disk), starting at offset 
0x2800. 

• Perform an exclusive OR of the data read from subdisk 4 with the data read from 
subdisk 3, storing the result in subdisk 3’s data buffer. This effectively “removes” the 
old data from the parity block. 

• Perform an exclusive OR of the data to be written to subdisk 4 with the data read from 
subdisk 3, storing the result in subdisk 3’s data buffer. This effectively “adds” the 
new data to the parity block. 

• Write the new data to 4 sectors of subdisk 4, starting at offset 0x2800. 

• Write 4 sectors of new parity data to subdisk 3 (the parity disk), starting at offset 

0x2800. 

• Read the old contents of 5 sectors from subdisk 5, starting at offset 0x2 0 00. 

• Read the old contents of 5 sectors from subdisk 3 (the parity disk), starting at offset 
0x2000. 

• Perform an exclusive OR of the data read from subdisk 5 with the data read from 
subdisk 3, storing the result in subdisk 3’s data buffer. This effectively “removes” the 
old data from the parity block. 

• Perform an exclusive OR of the data to be written to subdisk 5 with the data read from 
subdisk 3, storing the result in subdisk 3’s data buffer. This effectively “adds” the 
new data to the parity block. 

• Write the new data to 5 sectors of subdisk 5, starting at offset 0x2 000. 

• Write 5 sectors of new parity data to subdisk 3 (the parity disk), starting at offset 
0x2000. 

This approach is clearly suboptimal. The operation involves a total of 8 I/O operations 


The Vinum Volume Manager 


29 



and transfers 36 sectors of data. In addition, the two halves of the operation block each 
other, since each must access the same data on the parity subdisk. Vinum optimizes this 
access in the following manner: 

• Read the old contents of 4 sectors from subdisk 4, starting at offset 0x2800. 

• Read the old contents of 5 sectors from subdisk 5, starting at offset 0x2000. 

• Read the old contents of 8 sectors from subdisk 3 (the parity disk), starting at offset 
0x2 000. This represents the complete parity block for the stripe. 

• Perform an exclusive OR of the data read from subdisk 4 with the data read from 
subdisk 3, starting at offset 0x800 into the buffer, and storing the result in the same 
place in subdisk 3’s data buffer. 

• Perform an exclusive OR of the data read from subdisk 5 with the data read from 
subdisk 3, starting at the beginning of the buffer, and storing the result in the same 
place in subdisk 3’s data buffer offset. 

• Perform an exclusive OR of the data to be written to subdisk 4 with the modified 

parity block, starting at offset 0x80 0 into the buffer, and storing the result in the same 

place in subdisk 3’s data buffer. 

• Perform an exclusive OR of the data to be written to subdisk 5 with the modified 
parity block, starting at the beginning of the buffer, and storing the result in the same 
place in subdisk 3’s data buffer offset. 

• Write the new data to 4 sectors of subdisk 4, starting at offset 0x2800. 

• Write the new data to 5 sectors of subdisk 5, starting at offset 0x2 000. 

• Write the 8 sectors of new parity data to subdisk 3 (the parity disk), starting at offset 
0x2000. 

This is still a lot of work, but by comparison with the non-optimized version, the number 
of I/O operations has been reduced to 6, and the number of sectors transferred is reduced 
by 2. The larger the overlap, the greater the saving. If the request had been for a total of 
17 sectors, starting at offset 0x9800, the unoptimized version would have performed 12 
I/O operations and moved a total of 68 sectors, while the optimized version would 
perform 8 I/O operations and move a total of 50 sectors. 


30 


The Vinum Volume Manager 



Degraded read 

The following figure illustrates the situation where a data subdisk fails, in this case 


Offset Subdisk 1 Subdisk 2 Subdisk 3 Subdisk 4 Subdisk 5 

0x0000 

0x1000 

0x2000 

0x3000 


] Parity block 

] Data block involved in transfer 
| Inaccessible data block involved in transfer 
Inaccessible data 



0x0000 

0x1000 

0x2000 

0x4000 

0x5000 

0x6000 

0x8000 

0x9000 

Parity 

OxcOOO 

Parity 

OxdOOO 

Parity 

0x10000 

0x11000 



Figure 9: RAID-5 transfer with inaccessible data block 


subdisk 4. 

In this case, reading the data from subdisk 5 is trivial. Recreating the data from subdisk 
4, however, requires reading all the remaining subdisks. Specifically, 

• Read 4 sectors each from subdisks 1, 2 and 3, starting at offset 0x2 8 0 0 in each case. 

• Read 8 sectors from subdisk 5, starting at offset 0x2 800. 

• Clear the user buffer area for the data corresponding to subdisk 4. 

• Perform an “exclusive or” operation on this data buffer with data from subdisks 1, 2, 
3, and the last four sectors of the data from subdisk 5. 


The Vinum Volume Manager 


31 




• Transfer the first 5 sectors of data from the data buffer for subdisk 5 to the 
corresponding place in the user data buffer. 

Degraded write 

There are two different scenarios to be considered in a degraded write. Referring to the 

previous example, the operations required are a mixture of normal write (for subdisk 5) 

and degraded write (for subdisk 4). In detail, the operations are: 

• Read 4 sectors each from subdisks 1 and 2, starting at offset 0x2800, into temporary 
storage. 

• Read 5 sectors from subdisk 3 (parity block), starting at offset 0x2 000, into the 
beginning of an 8 sector temporary storage buffer. 

• Clear the last 3 sectors of the parity block. 

• Read 8 sectors from subdisk 5, starting at offset 0x2000, into temporary storage. 

• “Remove” the first 5 sectors of subdisk 5 data from the parity block with exclusive or. 

• Rebuild the last 3 sectors of the parity block by exclusive or of the corresponding data 
from subdisks 1, 2, 5 and the data to be written for subdisk 4. 

• Write the parity block back to subdisk 3 (8 sectors). 

• Write 5 sectors user data to subdisk 5. 


Parityless write 

Another situation arises when the subdisk containing the parity block fails: 


32 


The Vinum Volume Manager 



Offset Subdisk 1 Subdisk 2 Subdisk 3 Subdisk 4 Subdisk 5 

0x0000 


0x1000 


0x2000 


0x3000 


| ] Parity block 

J Data block involved in transfer 
Inaccessible data 

Figure 10: RAID-5 transfer with inaccessible parity block 

This configuration poses no problems on reading, since all the data is accessible. On 
writing, however, it is not possible to write the parity block. It is not possible to recover 
from this problem at the time of the write, so the write operation simplifies to writing 
only the data blocks. The parity block will be recreated when the subdisk is brought up 
again. 

Driver structure 

One important detail of the nature of the operations which must be performed for RAID-5 
access is that they frequently must be performed in two steps. This does not match well 
with the design of UNIX device drivers: typically, the “top half” 3 of a UNIX device 
driver issues I/O commands and returns to the caller. The caller may choose to wait for 
completion, but one of the most frequent uses of a block device is where the virtual 
memory subsystem issues writes and does not wait for completion. 

3. UNIX device drivers run in two separate environments. The “top half” runs in the process context, 
while the “bottom half” runs in the interrupt context. There are severe restrictions on the functions 
that the bottom half of the driver can perform. 




The Vinum Volume Manager 


33 





This poses a problem: who issues the second set of requests? The following possibilities, 
listed in order of increasing desirability, exist: 

1. The top half can wait for completion of the first set of requests and then launch the 
second set before returning to the caller. This approach can seriously impact 
system performance and possibly cause deadlocks. 

2. In a threaded kernel, the strategy routine can create a thread which waits for 
completion of the first set of requests and starts the second set without impacting 
the main thread of the process. At the moment this approach is not possible, since 
FreeBSD currently does not provide kernel thread support. It also appears likely 
that it could cause a number of problems in the areas of thread synchronization and 
performance. 

3. Ownership of the requests can be “given” to another process, which will be 
awakened when they complete. This process can then issue the second set of 
requests. This approach is feasible, and it is used by some subsystems, notably 
NFS. It does not pose the same severe performance penalty of the previous 
possibility, but it does require that another process be scheduled twice for every 
I/O. 

4. The second set of requests can be launched from the “bottom half” of the driver. 
This is potentially dangerous: the interrupt routine must call the start routine. 
While this is not expressly prohibited, the start routine is normally used by the 
top half of a driver, and may call functions which are prohibited in the bottom half. 

Currently, Vinum uses the fourth solution. This works for most drivers, but not for the 
Adaptec 154x driver on a system with more than 16 MB memory: since the 154x is an 
ISA bus master device, the driver must allocate bounce buffers on machines with more 
than 16 MB memory. The driver allocates these buffers by calling malloc, which calls 
tsleep if memory is not available. As a result, Vmum cannot be used on a system with 
an Adaptec 154x and more than 16 MB of memory. 

It is possible that this deficiency, possibly with others like it, will lead to a change in the 
driver structure; given the current alternatives, this would mean a daemon process to 
handle the I/O. 


34 


The Vinum Volume Manager 



Performance issues 


At present no detailled performance measurements have been made, but indications are 
that the performance is very close to what could be expected from the underlying disk 
driver performing the same operations as Vmum performs: in other words, the overhead 
of Vmum itself is negligible. This does not mean that Vmum has perfect performance: 
the choice of requests has a strong impact on the overall subsystem performance, and 
there are some known areas which could be improved upon. In addition, the user can 
influence performance by the design of the volumes. 

The following sections examine some factors which influence performance. 

The influence of stripe size 

In striped and RAID-5 plexes, the stripe size has a significant influence on performance. 
In all plex structures except a single-subdisk plex (which by definition is concatenated), 
the possibility exists that a single transfer to or from a volume will be remapped into 
more than one physical I/O request. This is never desirable in a system without spindle 
synchronization, since the average latency for multiple transfers is always larger than the 
average latency for single transfers to the same kind of disk hardware. Within the bounds 
of the current BSD VO architecture (maximum transfer size 128 kB), this increase in 
latency can easily offset any speed increase in the transfer. This is the main reason why 
Vinum does not implement RAID-2 and RAID-3, which always transfer to all drives. 

In the case of a concatenated plex, this remapping occurs only when a request overlaps a 
subdisk boundary. In a striped or RAID-5 plex, however, the probability is an inverse 
function of the stripe size. For this reason, a stripe size of 256 kB appears to be optimum: 
it is small enough to create a relatively random mapping of file system hot spots to 
individual disks, and large enough to ensure than 95% of all transfers involve only a 
single data subdisk. Preliminary testing has confirmed this recommendation. 

The influence of request structure 

For concatenated and striped plexes, Vmum creates request structures which map directly 
to the user-level request buffers. The only additional overhead is the allocation of the 
request structure, and the possibility of improvement is correspondingly small. 

With RAID-5 plexes, the picture is very different. The strategic choices described above 
work well when the total request size is less than the stripe width. By contrast, consider 
the following transfer of 32.5 kB, starting from the same offset as the previous examples: 


The Vinum Volume Manager 


35 



Offset 


0x0000 

0x1000 

0x2000 

0x3000 


Subdisk 1 Subdisk 2 Subdisk 3 Subdisk 4 Subdisk 5 


0x0000 

0x1000 

0x2000 

0x3000 

Parity 

0x4000 

0x5000 

0x6000 

Parity 

0x7000 

0x8000 

0x9000 

Parity 

• 

OxaOOO 

OxbOOO 

OxcOOO 

Parity 

OxdOOO 

OxeOOO 

Oxf000 

Parity 

" '* ’ . 

0x10000 

- 

0x11000 

. 

0x12000 

0x13000 


| J Parity block 

| J Data block involved in transfer 


An optimum approach to reading this data performs a total of 5 I/O operations, one on 
each subdisk. By contrast, Vinum treats this transfer as three separate transfers, one per 
stripe, and thus performs a total of 9 I/O transfers. 

In practice, this inefficiency should not cause any problems: as discussed above, the 
optimum stripe size is larger than the maximum transfer size, so this situation does not 
arise when an appropriate stripe size is chosen. 


Availability 

Vinum is currently under development. An alpha version of the base version (without 
RAID-5 functionality), running on the FreeBSD operating system, is available under a 
Berkeley-style copyright at [vinum]. The RAID-5 functionality is available under licence 
from Cybernet, Inc. [Cybernet], and is included in their NetMAX Internet connection 
package. 


36 


The Vinum Volume Manager 





Future directions 


The current version of Vinum implements the core functionality. A number of additional 

features are under consideration: 

• Hot spare capability: on the failure of a disk drive, the volume manager automatically 
recovers the data to another drive. 

• Logging changes to a degraded volume. Rebuilding a plex usually requires copying 
the entire volume. In a volume with a high read to write, if a disk goes down 
temporarily and then becomes accessible again (for example, as the result of controller 
failure), most of the data is already present and does not need to be copied. Logging 
pinpoints which blocks require copying in order to bring the stale plex up to date. 

• Snapshots of a volume. It is often useful to freeze the state of a volume, for example 
for backup purposes. A backup of a large volume can take several hours. It can be 
inconvenient or impossible to prohibit updates during this time. A snapshot solves 
this problem by maintaining before images, a copy of the old contents of the modified 
data blocks. Access to the plex reads the blocks from the snapshot plex if it contains 
the data, and from another plex if it does not. 

Implementing snapshots in Vinum alone would solve only part of the problem: there 
must also be a way to ensure that the data on the file system is consistent from a user 
standpoint when the snapshot is taken. This task involves such components as file 
systems and databases and is thus outside the scope of Vinum. 

• A SNMP interface for central management of Vinum systems. 

• A GUI interface is currently not planned, though it is relatively simple to program, 
since no kernel code is needed. As the number of failures testify, a good GUI 
interface is apparently very difficult to write, and it tends to gloss over important 
administrative aspects, so it’s not clear that the advantages justify the effort. On the 
other hand, a graphical output of the configuration could be of advantage. 

An extensible UFS. It is possible to extend the size of some modem file systems after 
they have been created. Although UFS (the UNIX File System, previously called the 
Berkeley Fast File System ) was not designed for such extension, it is trivial to 
implement extensibility. This feature would allow a user to add space to a file system 
which is approaching capacity by first adding subdisks to the plexes and then 
extending the file system. 


The Vinum Volume Manager 


37 



• Remote data replication is of interest either for backup purposes or for read-only 
access at a remote site. From a conceptual viewpoint, it could be achieved by 
interfacing to a network driver instead of a local disk driver. 

• Extending striped and RAIDS plexes is a slow complicated operation, but it is 
feasible. 


References 

[CMD] CMD Technology, Inc June 1993, The Need For RAID, An Introduction. 
http://www.fdma. com/info/raidinto. html 

[Cybernet] The NetMAX Station, http://www.cybemet.com/netmax/index.html. The first 
product using the Vinum Volume Manager. 

[FreeBSD] FreeBSD home page, http://www.FreeBSD.org/ 

[IBM] ATX Version 4.3 System Management Guide: Operating System and Devices, 
Logical Volume Storage Overview 

http://www.austin.ibm.com/doc_link/en_US/a_doc_lib/aixbman/baseadmn/lvm_overview.htm 

[Linux] RAID Solutions for Linux, http://linas.org/linux/raid.html 

[McKusick] Marshall Kirk McKusick, Keith Bostic, Michael J. Karels, John S. 
Quarterman. The Design and Implementation of the 4.4BSD Operating System, Addison 
Wesley, 1996. 

[OpenSource] The Open Source Page, http://www.opensource.org/ 

[SCO] SCO Virtual Disk Manager, http://www.sco.com/productsAayered/ras/virtual.html. 
[Solstice] http://www.sun.com/solstice/em-products/system/disksuite.html 
[vinum] Greg Lehey, The Vinum Volume Manager, http://www.lemis.com/vinum.html 
[Wong] Brian Wong, RAID: What does it mean to me?, SunWorld Online, September 

1995. http://www.sunworld.com/sunworldonline/swol-09-1995/swol-09-raid5.html 


38 


The Vinum Volume Manager 



Scripts to Manage Change 

Julie Jester & Stephanie Nicholls 


Change? What sort of Change? 


In a large organisation like the Westpac Banking Corporation, managing change is critical to 
the smooth day-to-day running of the organisation. In Sydney, alone, there are many 
application teams spread over several large office buildings throughout the CBD and at St 
Leonards, on the North Shore. These teams are responsible for all of the software used within 
the bank: from ATMs to the Westpac Web Site; from the software used to run the bank 
branches to that used to produce the fortnightly payroll; minor applications; major 
applications; software supplied by third party vendors or software developed in-house. 

This vast array of software runs on mainframes, mid-range proprietary systems including 
Tandem and Digital, NT networks and, of course, Unix. And it’s the Unix side of the world 
that this paper addresses. Unix is fairly new to Westpac, it has only been used over the past 
few years and it’s the wild-card in the previously ordered world of change management. 
Westpac already had change management packages in place on the mainframes and mid¬ 
range boxes, but these did not run on the Unix machines. Another solution was needed. 


OK, so what is Change Management? 

Exactly what it says - methodologies to manage change. It is typically broken up into four 
areas: 

1. Change Control 

This is the organisational information flow that says who is doing what and when they are 
doing it! Included in the change control process is the review and approval processes for 
changes to occur. 

2. Source code control. 

This covers keeping track of the various versions of the code, being able to replicate the 
code as at a certain stage within its change cycle, keeping track of which version went with 
which release of the software and being able to back out of any changes at any time. 

3. Managing the work flow process. 

Not only do you want to be able to keep control of the source code but you want to 
organise the way in which the code is developed and tested. This means the progression of 
the code through a logical development lifecycle. The typical workflow process may be a 
four stage process: development, integration testing, user acceptance testing and 
production; or it may be a three stage process where either the development and integration 
testing stages are combined, or the integration testing and user acceptance testing stages 
are combined. There also needs to be provision for a two stage process, from development 
directly into production - this, however, is only permitted for emergency fixes and, in 


Scripts to Manage Change 


39 



Westpac, must be ratified by following the normal process within a given time-frame 
(typically 11 working days). 

4. Distribution. 

This may be as simple as moving the programs from the development areas to the 
production area on the same machine, or it may involve moving it to one or more different 
machines. It also needs to be controlled so that the application teams can’t just drop in a 
new version over the top of the old version willy-nilly. 

With Westpac the source code control, the workflow and the distribution processes are known 
as “lodgement” and are managed and automated as much as possible using a number of 
platform specific tools (Endevor for MVS, RMS on Tandem and See/Change for AS/400). 
These tools and the code they contain are the responsibility of Lodgement System Services, 
part of the Host and Mid-Range Operations Group of Westpac Technology. Change control 
is handled by a monolithic mainframe application “Info/Man” and the approval process 
within change is closely tied to the lodgement mechanism. 

These areas are often linked together under the title Software Configuration Management 
with the distribution process sometimes contained within the SCM process and at other times 
being handled separately. 


Methods of Controlling Change on Unix. 

The Unix Lodgement Team came into being in 1996 when the Financial Management 
Systems (FMS) application team started implementing an Oracle package on an HP-UX 
based server. They needed a method of migrating these applications from the development 
environment into the production environment in a controlled fashion. Ideally they wanted all 
of the functionality that was currently available on the mainframes. Westpac’s intention was 
“to purchase and implement a leading, full functional package to perform version control, 
configuration management, and lodgement functions”. 

There was just one problem with this idea - evaluating, selecting and implementing such a 
package takes a considerable amount of time, and this did not fit into the timeframe required 
for the implementation of the FMS system. An alternative solution was required. 

The change control side of things could still be managed using the existing mainframe 
software, even though this did not directly link into the Unix boxes. 

Much of the application was a third party vendor product, provided as a complete package, 
and any Westpac developed source code could be versioned under RCS or SCCS. 

What was needed was an interim lodgement system. 


Version 1 of the Interim Lodgement System. 

HP came to the rescue and designed a series of scripts based on the mainframe lodgement 
processes. The scripts would be run by Westpac’s Batch and Recovery Team (BART) 


40 


Scripts to Manage Change 



operators, using a special lodgement functional user logon. The process used three major 
areas within the application’s environment. 


PARK 


LODGE 


PROD 


The PARK area is where the application team places code to be migrated into production. A 
decision was made that the full application must be lodged every time, regardless of whether 
one program had been changed, or many. The reason being that it made the backout of a 
change so much easier to manage and guarantee by Lodgement Systems Services. This area 
can be written to by the application team and must be readable by the lodgement functional 
user. Both source and executables must be placed in this area. 

The LODGE area acts as a repository for the new production version, and a selected number 
of backups of previous production versions. The application, both source and executables, is 
migrated across from the PARK area and secured so that only the lodgement functional user 
has write access, and the application team only has read access. 

The PROD area is where the application actually runs from. It may be part of a combined 
development/production machine or it may be on a separate production machine entirely. 
Only executables are migrated across from the LODGE area and, again, secured so that only 
the lodgement functional user has write access. To give an added measure of security and 
ease of application change the PROD area consists of two separate sections, execl and exec2. 
One holds the version that is currently in production and the other holds the previous version. 
A symbolic link is used to point the application exec directory to the correct exec directory in 
the PROD area. During a new lodgement the oldest version is overwritten, and then control is 
switched to the new version by removing and recreating the link. If, for any reason, the new 
version does not work correctly the previous version can be switched back in again in a 
matter of seconds. 


Scripts to Manage Change 


41 






Fixes. 

The ability to handle fix lodgements also had to be catered for. Within Westpac a ‘fix’ is one 
of those middle of the night emergencies when the “damn thing just won’t work as it should!” 
and hence the emphasis is on both speed of change and control. 

A section of the LODGE area was set up so that the source and executables for individual 
programs could be placed there by the application team. This was transferred to PROD, again 
into a separate directory. The application PATH had the fix area before the main application 
area so that any fixes would be picked up first. Since fixes are temporary under Westpac’s 
rules a script was set up, running under cron, to monitor how long the fix had been in 
production and, if a full lodgement had not taken place within eleven days after the fix 
lodgement, the system would be disabled by switching the link to an empty directory. 

It ain’t pretty but it works! 

So FMS had a lodgement system. It was a bit of a “black box” solution as there was no 
control over what happened prior to the code being placed into the PARK area but it provided 
a distribution solution that satisfied Westpac’s short-term requirements in a simple manner. 

Soon there were other applications teams using Unix. And they each needed a lodgement 
system. The original FMS system was copied and hacked around to meet the requirements for 
each new team - a full set of custom scripts for each team. Since this was only an interim 
solution it didn’t really matter if they weren’t pretty, as long as they did the job. 

Meanwhile a Unix Lodgement Team, consisting of two full time staff and two contractors, 
was busy evaluating Software Configuration Management packages to replace the interim 
system. After several months they finally selected a package, put together a proposal and ... it 
was rejected. 

The Unix Lodgement team was in the doldrums. One of the full-time staff members had 
already left, the other was looking to move on, and the contractors were also moving on. 

That’s when Steph and Julie came on the scene, in early 1997. 


Problems. 


By now the interim lodgement system was being used by six applications. 

Three of them were on a single box, two had separate development and production machines, 
and the last had one development machine and two production machines. 

They varied in size from 5mb of executable code to more than 60Mb. 

We had HP-UX, AIX and there were mutterings of new applications on SUN platforms. 

One ran from the production machine and used ftp get to transfer the files, another ran from 
the development machine and used ftp put to transfer the files, another used rep, and the 
single machine ones used cpio from PARK to LODGE, and cp from LODGE to PROD, but at 
least these three were standard. 


42 


Scripts to Manage Change 



One of the lodgement systems, handling the largest, two-production machine application, was 
not particularly robust and tended to collapse, usually in the middle of a weekend. 

A request came in for a new application, the seventh. This should have been easy, as it was 
identical to the three single machine systems already in place. Theoretically all the system 
variables were defined in one place, and used by all the scripts, but there were many hard¬ 
coded exceptions to this rule scattered throughout all the scripts. 

In fact, it was a classic case of a quick and dirty solution hanging around to haunt someone. 


Scripts to Manage Change 


43 



To add to the chaos the Unix Security team, responsible for the security of all the unix 
platforms within Westpac, were implementing some new strategies which impacted the 
lodgement systems. Rep was banned, and in some cases ftp get was not permitted, only ftp 
put, especially when pushing the code through firewalls. 

And more and more application teams were enquiring about lodgement systems, some with 
quite complicated requirements. 

There was a need to handle incremental lodgements with the appropriate backup and recovery 
paths. 

The scripts are currently run manually by a BART operator. In future it will be required that 
the scripts run automatically under the control of a scheduling tool, Control-M, to minimise 
the amount of manual work required for each lodgement. 

It was time for a rethink. 

The Interim Unix Lodgement System desperately needed both its achitectural and technical 
aspects reviewed to provide improved security, stability and maintainability. 


The Revision - alias Version 2. 

The requirements of the revised lodgement system are 

• To be easy to maintain when adding a new application. 

• To be easy to maintain for architectural changes to existing applications. 

• To be secure, complying with Westpac’s current security policies. 

• Not to rely on proprietary products. 

• To be consistent where appropriate across all unix applications. 

We are also re-evaluating Software Configuration Management packages and it is intended 
that the revised lodgement system would be used as the distribution tool initially rather than 
trying to implement a new distribution package as well as the selected SCM package. 

The Architecture. 

Essentially there are two architectural models. 

Single Machine Model 


44 


Scripts to Manage Change 




Multi Machine Model 



tar 


This model utilizes two transfer areas which are used by the ftp process. With this 
model there may be one or many production machines, and the number may be changed 
very simply and easily. If there is more than one production machine there is provision 
for machine specific code to be transferred to each appropriate machine along with the 
common code. 


Scripts to Manage Change 


45 






The Technical Design. 


The strategy behind the design of the revised scripts was to keep things as simple as possible. 
We also wanted a set of common scripts that could be installed on to each of the application 
platforms without any changes being required. A separate group of scripts would handle 
application specific variables and any customization required. Some of the scripts would be 
optional, some mandatory - such as the script that holds all the variable definitions. 

The original scripts used variable definitions to the point of ridiculousness, especially those 
relating to path or file names and when we had to maintain these scripts we spent more time 
trying to work out the value of a variable than we spent actually fixing the problem. We 
reduced these down to just those variables required for application-specific items, plus a few 
constants that were used in almost all of the scripts. 

The final major requirement was that the scripts be well laid out, structured, and easy to read. 
If Steph, who is not a programmer, could read them and understand them, they passed! 

Moving from PARK to LODGE 

This process is the same for both the single machine model and the multi machine model. 

The first step was to provide a series of checks on the PARK directory as many of the 
lodgement failures in the past had been caused by this area being incorrectly set up by the 
application team. These checks would be able to be run by the application team and would 
provide warnings of any errors. They would also run as the initial process in the transfer of 
the code from PARK to LODGE, and abort the lodgement if any errors were found, just in 
case the application team forgot to run the checks themselves. These checks included 
ensuring that all files and directories in the PARK area were readable by the lodgement 
functional user, that the version control file was the same for both source and exec directories 
,that a compile.log was present and had some data in it, and that the PARK area did not 
exceed a nominated size. 

The next step was to actually move the code from PARK to LODGE and secure it. The main 
change in this area was in how the version backups were stored. To save space and time we 
switched from cpio backups to compressed tar backups. Another major change was in how 
we named the backups. Originally there were three files: backup 1, backup2 and backup3. 
Each time a lodgement was transferred into the LODGE area it would roll down the backups, 
removing backup3, renaming backup2 to backup3, renaming backupl to backup2, and 
copying the current version into backupl before overwriting it with the new version. This 
made it very easy to lose track of which backup belonged to which version, and also if the 
same version was lodged more that once (which did happen occasionally) then the backup 
versions would not be correct. A simple change to use the version number as part of the 
backup name, and to remove the oldest backup using a time rule, got around that problem. 
Also if the same version is lodged more than once the system recognizes it is a relodgement 
and doesn’t try to do another backup. 

Moving from LODGE to PROD 


46 


Scripts to Manage Change 



This stage required one separate script for the single machine model, and two different scripts 
for the multi machine model. 

The single machine model was simple, just a change to use cpio to move the code across, 
instead of cp, plus the overall tidy-up of course. 

The multi machine model required two scripts, one which runs on the development machine 
to bundle up and transfer the executables, and any machine specific configuration files, to 
each of the machines using ftp put. The other, which runs on the production machine unpacks 
the compressed tar files and places them in the appropriate exec directory, the one that is not 
currently used for the live production code. 

Originally most of the applications had permanent .netrc files on their machines. This was a 
nuisance as whenever the password was changed on a machine we would have to go and 
change the .netrc file. Initially we tried building the .netrc file as we were preparing the 
transfer. This worked fine until Unix Security clamped down on .netrc files all together and 
we had to discover the wonderful power of ftp macros. Now it is all done within the script. 
The BART operators enter the passwords in response to prompts from the scripts and then re¬ 
enter them to confirm, otherwise we get too many ftp failures due to typos. 

We build a compressed tar file for all of the common executables and place it into a transfer 
area, then we build another compressed tar file for the machine specific files for each 
production machine, if required. Each production machine receives the common tar file plus 
the appropriate custom tar file. The number of production machines, their names and IP 
addresses or hostnames are specified in the variable definition script so this process just 
becomes a simple loop depending on the value of those variables. 


Scripts to Manage Change 


47 



Switching into Production. 

This is another common script that runs on both the single machine model, and the multi 
machine model and did not require many changes, except for a general tidy-up. We did have 
to cater for new applications that required more than one symbolic link to be set, so we check 
for the existence of a links.ctl file which contains link commands that replace the standard 
one. 

Fix Lodgements. 

These follow similar processes to the full lodgement with separate scripts being required for 
the transfer stages of the single machine model and the multi machine model. 

We also perceived a need for a script to install the fix into the current production exec 
directory, instead of using PATH to point to a separate fix area. This meant that we had to 
provide a suitable roll-back facility so that the fix could be removed if necessary. 

Customisations. 

Each of the scripts contains hooks to run predefined customisation scripts during all stages of 
the lodgement, if required. The invoking script checks for the existence of a specified script, 
and invokes it if it is present for that application. 

Currently we use the following: 

• A pre-get script to check for the existence of links in the PARK area during the 
checking process. 

• A script and C program that picks up the original permissions in the PARK area, 
turns them into chmod statements in a file which is transferred with the lodgement 
and run after the lodgement is unpacked into the exec area. 

• Most applications use post-install and post-fix scripts to set the file permissions, 
and in some cases ownerships, after the lodgement is unpacked into the exec area. 
This gave us yet another problem with a new application as they required the 
ownerships of both files and directories to be changed, which then meant we had 
difficulty removing them prior to unpacking the new version. We had to arrange 
for sudo to be set up to get around this problem. 

• A post switch script can also used to run any database updates, or other such 
things that are required as part of the installation of a new version. 

Most applications only have three or four custom scripts, one of which is their version of the 
master variable definition script. This means that we can keep the customisation processes 
segregated from the common scripts which, in turn, makes setting up a lodgement system for 
a new application extremely easy. 


The Current Status 

The revised lodgement system has been installed in all of our existing sites and we have 
added four more applications to our stable. Two of these new applications have added extra 
complexity to the system in that they actually run two separate lodgement systems under the 
same umbrella. We solved this by putting them in separate directories under LODGE and 
PROD, and the PARK areas are also separate. The .profile of the lodgement functional user 
contains a question to determine which lodgement is to be run. 


48 


Scripts to Manage Change 



We currently have five versions, 2.0, 2.01, 2.02, 2.03, and 2.04 each containing minor 
enhancements to solve various environmental issues: the .netrc issue, the sudo issue, some 
minor changes required for the port to the SUN platform (all Unixes are equal? Yeh, sure!) 
and some complicated middleware shutdown and startup requirements. 

We have put the incremental lodgement requirements on hold as none of our users 
desperately needs them yet, but this will be required at a later stage if we are to use the system 
as a distribution tool for the SCM package. 

We have been working on combining all the versions, along with some minor bug fixes, and 
an enhancement to remove the direct operator intervention and allow the scripts to be run 
from Control-M, the scheduling tool mentioned earlier. Testing on V2.1 is almost complete 
and we will be rolling this out over the next couple of months. 

We have also completed evaluating two major SCM packages and hope to be starting a pilot 
project on the selected package very soon. 

And someone, somehow, snuck in a name change for our team. We are now Client Server 
Lodgements. I think this means they are planning to toss NT in our direction as well... 

Thanks, Westpac. 

We’d like to thank the Host and Midrange Operations Group at the Westpac Banking 
Corporation for permitting this paper to be written, and allowing us to share some of our 
techniques and ideas in true Unix user group fashion. 


Scripts to Manage Change 


49 




















































































































































































































50 


Scripts to Manage Change 


Experience Using CVS for Long-Running Projects 

or 

Lost in a Monkey-Puzzle Tree 

Peter Chubb 
Softway Pty Ltd 

16 September 1998 


Abstract 

CVS, the concurrent version system is a wonderful 
freely available tool for helping to manage bodies 
of electronic data where changes are being made 
by different groups at the same time. 

Softway uses CVS internally for managing 
several products that involve modifying existing 
UNIX source code. Because of licensing and other 
restrictions, we have to split the repositories over 
separate machines, but maintain some part of the 
repository in common. CVS allows one to tag a 
file — to assign a label to a particular revision of 
that file. Some tags indicate a branch — a whole 
set of revisions derived from the tagged revision. 

After a project has been running for several 
years, it accumulates enough branches and tags 
to resemble a monkey-puzzle tree. Manual or 
semi-automatic procedures are needed to deter¬ 
mine what tags are what. 

CVS is not enough, on its own, to perform the 
kinds of change management we need. We have 
added to it various manual operations, and used the 
RCS state field to track release status. 

After beginning to create tools for better manag¬ 
ing branches and tags, we discovered the cvslines 
package, which do a better job. Moreover, recent 
versions of CVS provide new features that aid in 
controlling source. 


1 Introduction 

Softway’s main products involve making changes 
to other people’s source code, and delivering those 
changes to the original owners. At the same time, 
we develop code that links to the modified source 
code, and runs with it. 

So for example, a UNIX vendor licenses to us 
a copy of their UNIX source, that we can keep 
on one of their machines. We modify it to insert 
our hooks, then return patches to them for integra¬ 
tion. As their development process and ours pro¬ 
ceed concurrently, we exchange patches and snap¬ 
shots at irregular intervals, which we import into 
the CVS repository. Our small differences from 
their source tree eventually disappear, as they fin¬ 
ish integrating the changes, and as their release 
date approaches. 

CVS was originally designed with this scenario 
in mind [3], and it does it very well. CVS has the 
concept of a ‘vendor source’ branch built in to it, 
and automatically propogates changes made on the 
vendor source branch to all locally unmodified files 
on the mainline. 

The source code for our kernel modules must be 
compiled in the same environment (same compiler, 
same compiler flags, same include files, etc.) as 
the Unix kernel. The easiest way to achieve this 
is to graft our product source as a directory in the 
Unix source, and to modify makefiles accordingly. 


Experience Using CVS for Long-Running Projects 


51 



However, this either means that we need to repli¬ 
cate the repository for our product or use NFS and 
symlinks to allow the repository for our source to 
reside on a different machine from the repository 
for the kernel source. Some manufacturers’ NFS 
products don’t interwork with that of others as well 
as they might — and we’ve found this to be slow. 

CVS on its own is less ideally suited to man¬ 
aging multiple versions of the same product. It 
requires one to apply the same changes multiple 
times on different branches. Typically we want two 
kinds of branches — ones that totally isolate devel¬ 
opment (for releases or way-out experimentation) 
and ones that isolate only partially. Under vanilla 
CVS, only the vendor branch is automatically ap¬ 
plied to the mainline; other branches cannot be set 
up this way. 

One other feature of our development environ¬ 
ment is that individuals’ work areas are not backed 
up (to save time writing the tapes). All CVS repos¬ 
itories are backed up nightly. This has the side 
effect of encouraging developers to check in their 
work often — sometimes even in an uncompileable 
state. 

The usual modus operandi under CVS is for 
each developer to check out a copy of the source 
tree, make changes as desired, compiling and test¬ 
ing on the way, and checking changes back into the 
master repository at regular intervals. CVS refuses 
to commit changes to a file if the base version of 
that file is not current — a developer must perform 
a cvs update before a checkin. 

In recent versions (post CVS 1.9) there is an al¬ 
ternate mode (familiar to users of RCS or SCCS) 
— when a source file is checked out it is readonly; 
a specific command has to be given to get a read- 
write copy, and at most one developer can have an 
editable copy at a time. 

There are a number of different ways we’ve tried 
to use CVS; 

1. All development on the trunk, create a branch 
for releases. This is great except that when 
the product is close to release for one oper¬ 


ating system, and under active development 
for another, changes to explore behaviours in 
the development side of things tend to desta¬ 
bilise the system for the almost-at-release side 
of things. It has the advantage, though, of a 
uniform source base so that improvements in 
the general algorithm, and new features, are 
available to all systems imediately. 

2. Each developer has a branch; changes are 
merged into the trunk when things are stable. 
The advantage is that each developer has a 
stable source base to work from, and merging 
can happen at controlled times. The disadvan¬ 
tages are 

• that merging can be tedious, 

• that developers cannot share a checked- 
out source tree (important when the 
sources take several gigabytes). 

• that it does not scale very well to dozens 
of developers. 

3. Each ‘line-of-development’ has a separate 
branch; multiple developers can work on 
each branch. The advantages are that, with 
care, developers can share a checked-out 
source tree; almost stable releases can im¬ 
port only the changes desired from other 
branches (avoiding the destabilising effects 
noted above); the main disadvantages are that 
(what with creating branches off branches 
for releases and experimental purposes) one 
quickly ends up with a very complex branch 
structure, and far too many tagged variants of 
each file. 

Of these, when combined with the ‘cvslines’ 
product, method 3 is the preferred one. ‘Cvslines’ 
is a tool for using CVS with multiple lines of de¬ 
velopment. It avoids creating multiple identical re¬ 
visions on different branches — if a change is com¬ 
mon to all branches, and the file it is applied to has 
not changed since the branch, it operates by mov¬ 
ing the base-point for the branch. 


52 


Experience Using CVS for Long-Running Projects 



2 Source Licences 


UNIX source licences are typically for a single 
machine, usually one owned by the vendor from 
whom the licence was obtained. Therefore we 
maintain multiple CVSROOT directories, one per 
vendor, on different machines. These are not set up 
to run client/server (so parts of the unix source can 
be checked out only on the machine it is licensed 
for). Our own sources are on a separate machine 
that is configured to allow connections from CVS 
clients in our local network. 

That’s fine except that we want a common repos¬ 
itory for our layered products. Only the most re¬ 
cent versions of CVS (after about 1.9.11) allow 
this — before they were available, we had to write 
scripts to invoke CVS with different values for 
CVSROOT in different directories, or do the same 
thing (but manually). 

One could put a symbolic link inside the reposi¬ 
tory to a module inside another repository, but this 
has its own problems — the administrative files in¬ 
side CVSROOT than need to be replicated to mul¬ 
tiple repositories. 


3 Document States 

In our development process, each file passes 
through a number of states. Whenever a file is 
changed and checked in, it is in ‘experimental’ 
state (see figure 1). When a developer thinks that 
the changes s/he’s been making are ready (they 
compile, pass ‘lint’, match the design, etc.,) s/he 
changes the RCS state of the file(s) concerned 
with a cvs admin -sDesk command. This in¬ 
dicates that the file(s) are ready for desk-check. 
When the desk checker has scanned the change, 
either more changes are needed (in which case the 
developer has to make more changes, changing 
the state of the file back to ‘experimental’) or the 
changes are OK, in which case the desk checker 
changes the state to ‘Review’. 

The project leader at regular intervals schedules 



Figure 1: States a file passes through on its way to 
release 


document reviews of all files in state ‘Review’. 
Usually some changes have to be made as a result 
of the review: in the best case these are minor, so 
the state goes from ‘Review’ to ‘Pre-Release’, then 
when the changes have been checked, to ‘Release’. 
If the changes are more major, the state goes back 
to ‘experimental’. 

Cron jobs can scan the various CVS repositories 
looking for files in the different states, and gener¬ 
ate a posting to an internal newsgroup, so that the 
project manager (and anyone else interested) can 
track the progress of the project as a whole. 

In addition (because standard CVS commands 
cannot use the state as a key), reviewed files are 
tagged with a distinctive tag. This can then be used 
to extract all files reviewed (and ready for test) to 
build ready for testing. 

We used to lock the RCS master file revision that 
had been reviewed and checked (effectively pre¬ 
venting accidental modification). Since CVS 1.9, 
we can set a ‘watch’ on reviewed and checked files, 
which has the effect of: 

• making all future checkouts readonly. 

• Requiring explicit ‘cvs edit’ commands to en- 


Experience Using CVS for Long-Running Projects 


53 








able editing a file. 

• Invoking the ‘notify’ script in the CVSROOT 
area every time a cvs edit or cvs commit is 
performed on the file 

We set the notify script to send mail to a mailing 
list and a newsgroup, so that changes post-review 
can be caught, and trigger rechecking. 

4 Release Management 

Typically, a product goes through many releases. 
Any release that goes outside of the developer’s 
group has to satisfy certain requirements: 

1. All files making up the release are reviewed 
and checked in. 

2. All files in the release are tagged with a dis¬ 
tinctive tag, and the tag added to a flat-file 
database of tags and their meanings. 

3. A clean copy of the source is checked out us¬ 
ing the tag, and compiled. 

4. Generated files (formatted documents, soft¬ 
ware packages, etc) are inspected and tested. 
Test reports and significant documents are 
checked in and tagged with the same tag. 

5. If a review of the test reports indicate the 
product should not be released in its current 
form, the cycle restarts. 

6. Any generated files are checked into CVS 
(CVS handles binary files) and tagged with 
the release tag. 

Doing all this is necessarily a mostly manual 
procedure. CVS aids in ensuring that all the files 
have actually been reviewed (we can look at the 
‘state’ field), and tracks which revisions of which 
files make up a release. 


4.1 Handling a bug report 

CVS is not coupled to any bug-tracking system, 
and provides no explicit support for associating 
new revisions with a particular set of problem re¬ 
ports. 

When the first bug report comes for a released 
version, a branch is created for that version (if non 
exists as yet). This involves yet-another tag on ev¬ 
ery file in the repository. Otherwise, a new source 
tree is created using that branch tag. 

The developer then attempts to recreate the bug. 
If there really is a bug, then the fix has to be applied 
not only to the current branch (the one where the 
bug was reported) but potentially to other branches 
as well. 

Unfortunately CVS gives little support for this. 
We keep in the bug report record a list of 
files/revisions the bug is in, and and files/revisions 
the bug is fixed in, and so can merge (manually) the 
change into other branches (the bug record keeps 
track of which branches the fix has to be applied 
to). 

A product I’ve only recently discovered, cvs- 
lines provides some of the missing support. It 
works on different ‘lines’ of development (each 
such line corresponds to a CVS branch). When 
a change is checked in on a branch, cvslines asks 
which lines of development the change is for, and 
works out what to do to apply the change to each 
line of development. 

In the simplest case, the change can be applied 
to the main trunk, and the bases of the various 
branches can be moved (see figure 2. For exam¬ 
ple, a branch that has no revisions on it (branch 
revision 1.2.0) can merely be changed to branch 
revision 1.3.0 after the change has been applied to 
the main trunk. 

In more complex cases, the change has to be 
merged into all the branches (figure 3). Cvslines 
can create a set of merged files to examine, modify 
and check in. For example in the figure, the same 
modification is applied to the main trunk and to the 
branch, resulting in two new revisions for the same 


54 


Experience Using CVS for Long-Running Projects 



1.3 


1.2.1.2 



Figure 2: Moving the base of a branch After ten years of development, the number of tags 

and branches in the repository starts to resemble a 
monkey-puzzle tree. 

CVS tags in our source repository are of several 
kinds: 

• Temporary tags can be used by individ¬ 
ual developers for tracking pre and post 
change revisions (for handling a bug, or 
while merging lines of development). Tem¬ 
porary tags usually contain the devel¬ 
oper’s name, a reason and a date, e.g.. 


Experience Using CVS for Long-Running Projects 


55 
















peterc_premerge_19980725 but we 
do not enforce this. 

• Release tags are used to indicate revisions 
actually delivered to customers. During the 
hook-insertion process, several dozen such 
tags can be inserted, making the output of 
cvs status extremely hard to read. Re¬ 
lease tags are usually constructed from a 
product name (or an abbreviation for a prod¬ 
uct name) a target operating-system name, 
and a date, e.g., shr_irix65_19980622. 

• Snapshot tags are for snapshots of source 
delivered back to the owner, or received 
from the owner of that source. Such tags 
are of the form TO_SGI_19 97 0829 or 
FROM_SGI_19970604. 

• Branch tags indicate lines of develop¬ 
ment, and generally are constructed from 
a product name, a version number, and 
a target operating-system name, e.g., 
Share_IRIX65_rel2_0. 

• Tags that were created before this naming 
scheme was developed (far too many of 
them). 

In addition, significant tags are recorded in a file 
maintained on the main branch of the source tree 
(called taglist). 

Recent versions of CVS allow tag formats to 
be enforced (by putting templates in the ‘taginfo’ 
file), but we have not used this. 

The point is that without some policy the num¬ 
ber of tags and the forms of them quickly get out 
of control, as used to happen. 

6 The Ideal Version Control Tool 

The three major deficiencies in modem software 
change management systems identified by [7] are: 

1. Lack of ambiguity tolerance. This includes 
poor support for treating several items at 


once, e.g., in applying a change to several 
lines of development at once (the ‘cvslines’ 
problem); lack of support for identifying and 
manipulating permanent variants (the ‘ifdef’ 
problem); and consistency checking in am¬ 
biguous configurations. 

2. Lack of process flexibility. Many configura¬ 
tion management systems enforce a particu¬ 
lar software change management methodol¬ 
ogy (e.g., Aegis will not allow checkins that 
do not pass a test suite.) The ideal version 
management tool will, on the other hand, be 
adaptable to any organisation’s change man¬ 
agement philosophy. 

3. Lack of system integration. I’m not sure what 
the authors mean by this. 

From our point of view, the first is the most 
important. CVS provides no help for permanent 
variants; that is left to CPR And we always have 
to consider, when checking in a change, which 
branches it should be applied to. 

While CVS is not ideal, it has two major advan¬ 
tages over any other: 

1. It comes with source, so local modifications 
are possible, and 

2. It does most of what we need. 

For example, our Sharell kernel module has 
large chunks that are common for all the operat¬ 
ing systems it has to link to. However, different 
locking policies in different kernels, names of key 
kernel variables, and significant architectural dif¬ 
ferences between kernels (significant for Sharell, 
that is — many are not visible from user-space) 
mean that the code can very easily become a maze 
of #ifdef statements that is very hard to read or 
modify. Unfortunately, one does not wish to create 
a branch at the CVS level for feature differences 
when the algorithm being used is common to all 
operating systems, and it’s only, say, that some ma¬ 
chines have atomic operations on 64-bit quantities 


56 


Experience Using CVS for Long-Running Projects 



available in the kernel, and some other machines 
do not, and so require the addition of explicit lock¬ 
ing; or in some machines lock A must be acquired 
before lock B, and on a different machine, lock B 
must be acquired before lock A, and (because the 
lock implementation requires it) must be at SPLHI. 
At present, we live with too-many ifdefs. 

The alternatives are: 

1. Use lots of ifdefs. 

2. Use separate source code files for each vari¬ 
ant, and rely on applying changes to all of 
them when common bugs are found. 

3. Assume the lowest common denominator, if 
it can be determined 

. None of these is entirely satisfactory. We’ve 
chosen the ‘lots-of-ifdefs’ approach at present; that 
may change. 

Again, using our Sharell kernel module as an 
example, each release eventually ends up on a sep¬ 
arate branch (as described above). When a change 
is made to fix a bug in one branch, it may be ap¬ 
plicable to several of the branches. So the same fix 
has to be applied more than once (usually by cut- 
and-paste, with the corresponding problems that 
has). Our bug-tracking system has provision for 
this, but it would be much nicer if the source 
control system could provide support for applying 
changes to multiple lines of development at once. 
I’d begun to write scripts for this, when I discov¬ 
ered the product ‘cvslines’ which seems mostly to 
solve the problem, and opens new ways of using 
CVS. 

The ideal version control tool would have the 
flexibility to suit many purposes at once. It would 
allow easy distinction between revisions in dif¬ 
ferent states; would allow controlled interaction 
between developers; would allow different views 
of the source for different purposes (for example, 
when creating a source release, one wants a view of 
the source that includes only those bits that are to 
be released to the customer the source is built for 


— preprocessing with CPP does too much. The 
BSD tool, unifdef goes a long way in the right 
direction, but not all files are C-like enough to cope 
with #ifdef); would be sufficiently coupled to 
the software development methodology being used 
to allow document state to be tracked and con¬ 
trolled, but sufficiently general that it can be used 
with many different software engineering method¬ 
ologies. It would be low cost (both in licence fees 
and in machine resources to run it), come with 
source, allow a unified view of multiple reposito¬ 
ries, and scale well to many developers. 

7 Conclusions 

We’ve used CVS now since version 1.3, having mi¬ 
grated from SCCS [5] through RCS [6], a home¬ 
grown system called ‘stools’, and finally CVS. 
We’ve also evaluated commercial tools such as 
ClearCase [1] and perForce [2], 

All these have different drawbacks The best 
we’ve found is CVS, used as described above. 
Adding ‘cvslines’ may improve the situation yet 
more, but I only came across that while preparing 
this paper, so we haven’t tried it very much yet. 


References 

[1] Atria clearcase home page. URL: 

http://www.rational.com/products/ccmbu/. 

[2] Perforce home page. URL: 

http://www.perforce.com/. 

[3] Brian Berliner. Cvs ii: Parallelizing soft¬ 
ware development. In Usenix Winter Confer¬ 
ence. USENIX Association, Berkely, Califor¬ 
nia, 1990. 

[4] Cvs home page. URL 

http://www.cyclic.com/cvs/info.html. 

[5] M. J. Rochkind. The source code control sys¬ 
tem. IEEE Transactions on Software Engi¬ 
neering, SE-1(4):364—370, December 1975. 


Experience Using CVS for Long-Running Projects 


57 



[6] Walter F. Tichy. Res—a system for version 
control. Software Practice and Experience , 
15(7):637—654, July 1985. 

[7] Andreas Zeller and Gregor Snelting. Unified 
versioning through feature logic. ACM Trans¬ 
actions on Software Engineering and Method¬ 
ology, 6(4):398-441, October 1997. 


58 


Experience Using CVS for Long-Running Projects 



Autoinstall: Automating Platform Installation 

or How to install thirty complex servers before morning tea 

Gordon Rowell 
Gormand Pty Ltd 
Gordon.Rowell @ gormand. com.au 

Peter Bray 

Telstra Corporation Limited 
Peter.Bray @ind. tansu.com.au 

Introduction 

Autoinstall is an automated installation system for UNIX hosts. It provides hands-off, reproducible 
installation of software, patches and individual host configuration for networks of machines 

Autolnstall was originally developed to automate the installation of Telstra Intelligent Network 
sites, which comprise as many as thirty hosts, each with specific hardware, operating system and 
software configuration. Autolnstall can install an entire site of thirty machines in under two hours, 
completely unattended. The general nature of Autolnstall allows automated, reproducible 
installation of any types of system. 

Motivation 


Many sites use installation scripts to simplify installation. The installation scripts are normally 
written to cover a particular installation, and only that installation. Most installation scripts lay 
down implementation details in code, and this code includes file references which are usually 
tightly integrated into the scripts. For example, the location of the software repository is likely to be 
coded into the various scripts which handle software installation. 


New installations require new scripts, and each of these would again encode the various fixed 
locations required for the installation to proceed. Separation of the important information into 
configuration files helps, but these locations are still visible to the script developer as they are 
scattered through the code. 


Autolnstall uses high-level directives to separate the definition of the function required from the 
implementation of that function. These directives are independent of the flavour of Unix being 
installed and Autolnstall understands the concepts of installing software on a local host, installing 
software on a mounted client system, and testing the configuration on an installation server prior to 
installation on a live host. 


Autolnstall reduces the requirement for installation documentation as the directives define the steps 
required and the audit logs confirm that all steps were completed. The logging subsystem collects 
the output from each of the directives and commands. It also keeps track of any command or 
directive failures and notes these in the summary log, which will be empty for a successful 
installation. 

In a production environment it is often faster to replace a faulty system disk and Autolnstall the 


Autolnstall: Automating Platform Installation 


59 



host than it would be to call a trained systems administrator to diagnose the problem and suggest a 
solution. The faulty disks can be removed for analysis while the host is returned to service. 

Directive Files 


Autolnstall parses the hierarchy of directive files on startup, and logs errors for any unimplemented 
or incorrectly used directives. The Autolnstall framework allows extra directives to be easily added. 
This allows common functions to be abstracted into new directives. These can be added to the 
Autolnstall base, or can be a project specific extension. 


Autolnstall directive files use location independent references to files. Autolnstall determines the 
correct file by applying a search path, which starts with information about a particular host, 
progresses through the operating system and the operating system family, and ends with common 
configuration. The use of high-level directive files and the implied search path allows an installer to 
concentrate on what needs to be installed, not how this is performed for a particular operating 
system. 


The following example is a simplified Autolnstall directive file for Oracle database servers. The 
configuration avoids references to individual hostnames or sites through the use of Autolnstall 
variables. The implementation of the referenced directive files is not shown here as they are not of 
concern to the developer of this directive file. This separation is a major benefit of the modular 
approach taken. 


# Section 

require 

require 

suppress 

require 

append 

replace 

edit 


1: Host configuration 
OSPatches 
DNSClient 
Software/pine 
StandardUtilities 

/etc/hosts—${GenericConfig::Hostname} 

/etc/hosts.allow--${GenericConfig::Hostname} 
/etc/inittab-ConsoleXterms 


0444 root root 


# Section 2: 

rootpasswd 

addaccounts 


Account management 

root-${GenericConfig: -.Site} 

passwd-${GenericConfig::Site} group-${GenericConfig::Site} 


# Section 3: Disk configuration 
require DiskSuite-4.1 

reboot 

partition clt [0-5]d[0-4] OnePartition 

replace /etc/opt/SUNWmd/md.conf—OneBigSSA 0444 root root 

makefilesystem /dev/md/rdsk/d40 /export/md/d40 1775 root root 
loopback /export/md/d40/bigdb /bigdb 


# Section 4: Oracle and database creation 
require Oracle-7.3.2 

shellscript CreateDB /bigdb 


In section one the operating system patches for this particular operating system are installed. There 
is no reference to the specific set of patches, as the Autolnstall search path determines the correct 
OSPatches file to use. The machine is then configured as a DNSClient and standard utilities are 
installed, except for pine, which has been explicitly suppressed. The /etc/hosts and 
/etc/hosts. allow files are configured with information specific to the host being installed 
through the use of the $ {GenericConf ig::Hostname} variable. The final directive in this section 
changes the console terminal type to xterms. 


60 


Autolnstall: Automating Platform Installation 



In section two various accounts are created, and the root password is set. All of the referenced files 
use the $ {GenericConf ig:: site} variable so that separate accounts and root passwords are 
created for each site. 

Section three provides configuration of the disks on the host. This section of the Autolnstall 
configuration is currently operating system specific as DiskSuite-4.1 is only supported on Solaris. 
The DiskSuite-4.1 directive file installs DiskSuite, relevant patches, and increases the number of 
metadevices available. The host is then rebooted which allows DiskSuite metadevices to be used in 
further directives. Autolnstall continues automatically with the directive following the reboot. 

The next directive partitions all of the disks in a SPARCStorageArray on controller one using a 
partition file called OnePartition. The next three directives provide configuration of DiskSuite to 
create a single huge filesystem mounted on /bigdb. 

The final section installs Oracle and then calls an external shell script to create a database under the 
/bigdb. The Oracle-7. 3.2 directive file installs Oracle and its startup files, adds the appropriate 
shared memory configuration options to the kernel and reboots the host. 

Search Paths 


The use of the search path for all file references becomes very powerful when applied to directive 
files. For example, many projects might require database configuration, but each project might have 
specific changes to that configuration. The directive file Database-Client might look something 
like: 

require Database-Client-OSPatches 
require Database-Client-Software 
require Database-Client-Config 


Since the search path is applied to all file references, any or all of the directive files can be 
specialised for each project, or even each individual host. For example, the files actually used in the 
configuration might be: 

Configuration/Common/Directives/Database-Client 

Configuration/SunOS/5.5.1/sparc/Directives/Database-Client-OSPatches 

Configuration/SunOS/5.X/Directives/Database-Client-Software 

Proj ects/MyProject/Configuration/Common/Directives/Database-Client-Config 

If the hierarchy is designed in a modular and reusable way, the project specific trees become quite 
small and provide very accurate documentation on the specifics of the project. Moving to SunOS 
5.6 might simply require the creation of the directive file 

Configuration/SunOS/5.6/sparc/Directives/Database-Client-OSPatches. Note that this 
has not changed support for earlier versions such as 5.5.1, just added support for 5.6. 

As an example of the power of search paths, a platform under construction might need a utility like 
sysinfo, which on SunOS 4.x requires a version for each kernel architecture, but on Solaris has a 
single version for all kernel architectures. The following directive could be used to install the 
correct version of sysinfo-3.3 

tarpackage /pkgs/sysinfo-3.3 sysinfo-3.3 


Autolnstall: Automating Platform Installation 


61 



The directive finds the appropriate version of sysinf o-3.3. tar (optionally compressed or 
gzipped) in the Autoinstall hierarchy and installs it on the client into /pkgs/sysinfo-3.3. 

The search path is constructed from information about the host to be installed. A (greatly 
simplified) search path is shown below, and is applied from most-specific (Hostname) to 
least-specific (Common - used when no other match has been found): 


Hostname 

(e.g. 

myhost.some.dom) 

Domainname 

(e.g. 

some.dom) 

Kernel Architecture 

(e.g. 

sun4u) 

Processor 

(e.g. 

spare) 

OS Version 

(e.g. 

SunOS/5.5.1) 

OS Family 

(e.g. 

SunOS/5.X) 

OS Common 

(e.g. 

SunOS/Common) 

Common 

(e.g. 

Common) 


The search path can be significantly more complex than this as Autolnstall supports projects and 
sub-projects, each of which could provide hierarchies similar to the above. The important point is 
that the search path is applied to all file references, allowing complex hierarchies to be developed, 
without changing the directives. 


Since sysinfo is architecture and operating system specific, the developer would need to place the 
appropriate tar archive in the relevant directory, as shown below: 


OS 

SunOS 

4.1.3_U1 

(sun4m) 

SunOS 

4.1.3_U1 

(sun4c) 

SunOS 

5.5.1 

(SPARC) 

SunOS 

5.5.1 

(Intel) 


Pathname 

SunOS/4.1.3_Ul/sun4m/TarPackages/sysinfo-3.3.tar.gz 
SunOS/4.1.3_Ul/sun4c/TarPackages/sysinfo-3.3.tar.gz 
SunOS/5.5.1/sparc/TarPackages/sysinfo-3.3.tar.gz 
SunOS/5.5.l/x86/TarPackages/sysinfo-3.3.tar.gz 


Note that the directive would fail if this configuration were used or tested for a sun4d machine 
running SunOS 4.1.3JJ1 as no tar file exists for the sun4d kernel architecture. Autolnstall does not 
remove the need to compile each relevant version, but greatly reduces the effort required to install 
the software on many differing systems. All that is required to make this configuration work for 
sun4d machines is to place the appropriate tar file in the relevant directory. 

File Naming Conventions 


All data for Autolnstall is held in standard directory trees on the installation server. It was decided 
that introducing directory trees to mimic the client directory structure would be inconvenient and 
create large numbers of directories containing a single file, and so a mapping was introduced to 
flatten the hierarchy. Leading slashes in filenames are deleted, and all other slashes are converted to 
underscores. There are also requirements to make multiple changes to the one file in the client’s 
filesystem, and this is supported through the use of the filename comments. Filename comments 
also enhance the readability of the directive files, as shown in the following example. 

Directive File Directive File to search for 

TerminalServer append /etc/services--erpcd etc_services erpcd 

Bootpd append /etc/services--bootpd etc_services--bootpd 


User extensibility 


62 


Autolnstall: Automating Platform Installation 



Autolnstall is written in PERL and provides several major hooks for users to extend or customise 
its operation. Each project or sub-project can have a module called LocalModuies. pm, which can 
do simple configuration such as defining sites and variables, or modifying the search path. For the 
more adventurous it can add or replace directives or provide additional Preinstall and Postinstall 
configuration. 

We have attempted to keep site-specific code out of Autolnstall, and promote the use of 
LocalModuies .pm. In our own case, we define our host naming convention, and add some 
site-specific directives. For example, we have added tests to allow a system to be reinstalled while 
preserving the database disks. These tests replace standard directives with directives which know 
how to preserve the data on the database disks while allowing reinstallation of the system disks. 

Testing and Quality Assurance 

Autolnstall contains extensive support for testing and debugging directive files. Because all the 
code uses a common testing and debugging framework, it is possible to testBuild a configuration 
on the installation server. Directives understand that they are being called in a testing mode, and do 
not modify any files. By setting environment variables it is possible to test installation of machines 
which have a different operating system or machine architecture from the installation server. 

For example, 

TARGET_OS=SunOS TARGET_OSREV=4.1.3_U1 testBuild mySunOSbox.some.dom 
TARGET_PROCESSOR=x86 TARGET_OSREV=5.6 testBuild mySolarisPC.some.dom 


The ability to test the configuration without requiring additional hardware allows a separation 
between installation developers and system installers. Once the installation is fully tested the final 
hardware can be assembled and installed without intervention. Autolnstall directive files are small 
text files, and so can be easily versioned and compared. Full audit logs are left on the installed 
platform and can be mailed to a central site as part of the install. 

Autolnstall has numerous benefits when it comes to system testing. System installs prior to 
Autolnstall required a 300+ page installation manual, and took at least a day per machine. System 
installs for Sun SPARC machines now require the installer to type boot net - install and to 
check the installation log once the install is finished, in approximately one hour. A typical machine 
has over 350 directives applied to it after the operating system has been installed, and each of these 
directives embodies many lower level concepts. For example, the graft directive creates a 
directory, unrolls a tarpackage, changes the installed permissions and then calls the graft command. 

Complete installation testing is now a possibility. System testers can install the machine afresh after 
all of the developer "fixes" have been applied and tested to ensure that the delivered system is the 
one that was tested, and that all such "fixes" make it into a formal release. Machines can be 
identically configured at all sites, including the testbed, so that problems found at one site can be 
checked and fixed at other sites with certainty. 

It is also possible to have reconfigurable testbeds. If a common hardware base is chosen it is 
feasible to rebuild the testbed for each project as it goes into system testing, and switch between 
testbed environments through a simple hands-off install. This is particularly useful for retrofitting 
patches to earlier releases as the testbed can be returned to the state it was in when the application 
was released. 


Autolnstall: Automating Platform Installation 


63 



Designing for reuse 

Many of the tasks faced by systems administrators are similar for each operating system - OS 
patches, tightening security, installation of standard tools and accounts. It is beneficial to spend 
time building a reusable base which describes these common tasks, so that a new operating system 
only requires the blanks to be filled, rather than the development of a complete hierarchy. 

A rich and reusable base configuration allows rapid development of configurations for new 
operating systems and one-off machines. It also enforces standards across platforms, which eases 
administration and support. If the base configuration is suitably rich, it is possible to have extremely 
small project configuration trees, which have only truly project-specific changes. 


An example of such a design might be: 


PlatformlndependentBaseline 

require VendorOS # 

require VendorOS-Patches 
require VendorOS-BasicSecurity 
require VendorOS-AdditionalTools 
require VendorOS-DisableUnusedFeatures 


Top-level wrapper 

# Patch set for particular OS 

# Tighten default vendor security 

# Add vendor tools not in the OS 

# Remove unwanted baggage from OS 


require SystemAdministration 

require SystemAdministration-Basic 
require sendmail-removal 
require qmail-software 
require qmail-nullclient-config 
require xntp-software 
require xntp-client-config 
require sudo-software 
require sudo-config 


# e.g. syslog policy 

# Install a good MTA 

# A base - may well be overridden 

# IP addresses project specific 

# Sudo permissions project specific 


require CoreUtilities 

require bash-software # A decent universal shell 

require sysinfo-software 


require UserEnvironment-Basic 


To use this infrastructure, a project might have a directive file MyPro ject-Baseline which would 
require PlatformlndependentBaseline and then perform project Specific CUStOmisationS. 

The configuration can be tested with testBuild which will flag any missing files or incomplete 
directives. The placement of directive files in operating system family directories (e.g. SunOS/5.X) 
allows a standard configuration for all of that family, which can then be overridden for specific 
versions if required. 

Current Status 

The original version of Autolnstall was an extended "finish script for Solaris 2.x JumpStart. The 
current version supports 35 directives and three pseudo-directives which perform all postinstall 
customisation for any Unix-like operating system. The native operating system support is used to 
provide a basic operating system installation, and then Autolnstall is called from an NFS mount to 
provide all other configuration, in an operating system independent manner. 

Autolnstall also provides support for "foreign" operating system installations. For example, SunOS 


64 


Autolnstall: Automating Platform Installation 



is installed by booting the host across the network under Solaris 2.x and then restoring dumps of 
SunOS onto the disks. Once SunOS has been installed the same directive files are used for both 
Solaris and SunOS postinstalls. 

The current version of Autolnstall provides full support for Solaris (SPARC and x86), SunOS and 
preliminary support for SCO and Linux. The Autolnstall hierarchy is installed onto an install 
server, and the hosts build from that hierarchy, using a read-only NFS mount. Autolnstall has a 
self-contained toolset which is compiled and used for the relevant architecture and operating system 
configuration. 

A recent addition to the toolset has been a custom Solaris 2.5.1 CDROM, which installs the install 
server itself. The install server is installed with the customised CDROM and the Autolnstall tape, 
and installs itself without intervention. The custom CDROM builds the filesystems for Autolnstall, 
reads the tape, and calls Autolnstall to make any other required changes, including copying and 
patching the newly installed CDROM image for future installs. 

One major benefit of the use of Autolnstall is the standardisation of platform installation across 
operating systems and architectures. Installing for a different architecture or operating system is a 
matter of compiling the tools and packages for the target version. 

Autolnstall was designed for static installation of hosts. This is both a strength, as it guarantees a 
known state; and a weakness as it does not cover the full life cycle of deployed platforms. The 
installation of versioned packages and the use of graft simplifies this task. The Configs package 
provides a longer term view of platform maintenance, and an eventual merging of Autolnstall and 
Configs would provide support for the full lifecycle. 

Futures 

The current version of Autolnstall has been in constant use for many projects for two years. Over 
200 machines, of more than 100 types, have been installed at various sites. The problems faced are 
annoying rather than show-stoppers. 

The pseudo-directives require, include and suppress are currently executed during 
configuration parsing. It would be useful to have these evaluated at runtime to allow diversions, or 
to pass arguments to directive files, based on previous configuration entries. For example, it would 
be useful to be able to say: 

require PlatformSupportScripts ${LocalConfig::PSSVersion} 


where $ {LocalConfig:: PSSVersion} has been set at some stage in the directives hierarchy. Even 
more generally, it would be useful to have default settings of variables, as in the following example 
from a hypothetical PlatformSupportScripts directive file: 

setvariable PSSVersion $1 ${LocalConfig :: PSSVersion} 0.0.1 

This would set PSSVersion to the first argument passed to this directive file, or to the value of 
$ (LocalConf ig:: PSSVersion} if that is set, or to 0.0.1 as a default. 

The workaround with the current parser is to write directive files using variables which must be set 
somewhere in the directive hierarchy or the parser will complain. This method of writing directive 


Autolnstall: Automating Platform Installation 


65 



files exposes too much internal detail to the end-user as top-level directive files end up as: 

setvariable SomeSoftwareVersion 1.2.1 
setvariable ThisSoftwareVersion 3.5.0 
setvariable ThatSoftwareVersion 2.4.3 
require MyProject-Baseline 

The other major area of work would be the generalisation of disk, filesystem and mirroring 
configuration in an operating system independent manner. A metaconfiguration file is required 
which describes the relationship between disk slices, slice size, filesystems, mirrors and stripes. 
Much of the supporting infrastructure already exists, but work is required to provide an operating 
system independent view of these facilities. This information could also be used to generate profiles 
for the initial operating system installation, for use by JumpStart, KickStart or similar. 

Summary 

Autoinstall provides a complete set of directives for automation of system installation. The use of 
high level directives allows installers to concentrate on what they want performed, not how this is 
achieved for a particular platform. 

The same directive files are used for automated installation and non-destructive testing and, if 
designed correctly, for all operating systems and architectures. The standardised error reporting 
provides a consistency which is difficult to achieve in discrete installation scripts and provides this 
consistency across architectures and operating systems. The powerful search path mechanism 
allows generic functionality to be abstracted into common directive files which can then be used in 
project specific specialisation. 

Reproducible, verifiable, hands off installation provides a major boost to the reliable installation of 
a site. Autolnstall can install a network of thirty machines, including all patches, software and host 
configuration without intervention, in less time and more reliably than it would take to install a 
single machine manually. 

About the authors 

Gordon Rowell is an independent systems and network consultant, specialising in UNIX systems 
administration. His areas of interest are high-availability systems and reducing the administration 
overhead of large networks and enjoys writing tools in these areas. His email address is 
Gordon.Rowell@gormand.com.au 

Peter Bray is an analyst/programmer specialising in infrastructure development on UNIX systems. 
His computing interests include operating systems, programming languages and object oriented 
design and analysis. He enjoys studying and specifying the architectural issues of complex systems. 
His email address is Peter.Bray@ind.tansu.com.au 

References 

Graft - Virtual Package Installer by Peter Samuel - ftp://ftp.uniq.com.au/pub/tools/graft 

Configs - Host configuration tool by Simon Gerraty - 
http://www.quick.com.au/FreeWare/configs.html 


66 


Autolnstall: Automating Platform Installation 



A Current Perspective on Encryption 

Algorithms 


Dr Lawrie Brown 

School of Computer Science, Australian Defence Force Academy, Canberra, Australia 

Email: Lawrie.Brown@adfa.edu.au 


Abstract 

This talk will present a perspective on the current state of play in the field of encryption algorithms, 
in particular on private key block ciphers which are widely used for bulk data and link encryption. I 
will initially survey some of the more popular and interesting algorithms currently in use. Then I’ll 
describe the recent call by the US NIST for submissions to define an Advanced Encryption 
Standard to eventually replace the DES. I will sketch the requirements they’ve given and briefly 
describe the candidates currently known. 


Introduction 

With growing awareness of, and concerns about, computer and communications security, there is 
increasing incorporation of cryptographic algorithms into such systems. This paper presents a brief 
overview of the current state of cryptographic algorithms, with an emphasis on private key block 
ciphers. These are widely used for applications requiring bulk data or link encryption. It has been 
prompted in part by the US NIST (National Institute of Standards and Technology) call for 
submissions of candidate algorithms for an Advanced Encryption Standard, to replace the exising 
DES, which is currently underway. 

Types of Encryption Algorithms 

Cryptographic algorithms can be broadly catagorised as follows: 
private (single, shared) key algorithms 

are used for bulk data encryption, as they are comparatively fast. They do require a secret key 
to be known by both parties. Examples include: DES, Blowfish, IDEA, LOKI, RC4. 
public (two) key algorithms 

are used for key validation & distribution, good because a public key is used, but limited by 
their computational cost (and thus slow speed) compared to private key schemes. Examples 
include: Diffie-Hellman, ElGamal, RSA 
signature algorithms 

are used to sign & authenticate data, and are also usually public key based. Examples include- 
ElGamal, RSA, DSA 
hash algorithms 

are used to compress data down to a fixed size for signing. Examples include: MD5, Haval, 


A Current Perspective on Encryption Algorithms 


67 





In practice, many systems involve a combination of these algorithms, with public key algorithms 
being used to exchange session keys and to authenticate a hash of the information, and private key 
algorithms to encrypt the data (eg. as in PGP secure email). See any good crypto text for details, eg 
Stallings [Stal98], or Schneier [Schn96]. 

Block Ciphers Past 

In this talk, I will be focusing on Block Ciphers, which are the most common form of private key 
algorithms. These operate on a fixed size data block (eg 64 bits), use a single secret, shared key (eg 
56, 64, orl28 bits), and generally involve multiple rounds (8-32) of some simple, non-linear 
function which uses half the value as input, and whose output is XOR’d with the other half. This is 
known as a feistel structure, invented by Horst Feistel of IBM in the early ’70’s, and its great 
advantage is its easy inversion for decryption. There are various standard modes of use which allow 
a variable amount of information to be processed by the algorithm’s fixed size inputs and outputs. 

For encrypting bulk data, the ECB or CBC modes are used; for stream data, the CFB or OFB 
modes. 

The best known and most widely used block cipher is the DES, which has a 64 bit data block and a 
56 bit key. It evolved from the earlier Lucifer algorithm, developed by Feistel and colleagues at 
IBM in the early 70’s. It has been the standard algorithm for banking and other applications since 
1977. 

Any block cipher can theoretically be attacked by exhaustively trying all possible keys, whether this 
is feasible or not depends on the size of the key. If an attack faster than exhaustive search is 
possible, then the cipher is regarded as, at least theoretically, broken. Recently several brute force 
recoveries of 56-bit DES keys have been demonstrated [RSAD97] [EFF98a], which graphically 
illustrate the advances made in computational speeds in recent years. 

Also since the early 90’s, theoretical attacks on the DES, which are faster than exhaustive key 
search, have been developed as discussed below. In response to this, as an interim measure, we see 
the deployment of Triple-DES. This uses two (112 bit) or three (162 bit) keys and three passes of 
the DES algorithm on each block (at consequently 1/3 the speed). In the longer term, the US NIST 
is looking at standardising a new algorithm. 

Aside from DES, a number of other block ciphers have been proposed in the intervening years. In 
the late 80’s, several proposals were published by the academic community. The best known of 
these include: 

FEAL 

a 64-bit, n-round (where n has grown from 4 to 32+) feistel block cipher with a 64 or 128 bit 
key, designed by Shimizu and Miyaguchi from NTT Japan in 1987. Implemented in both 
hardware and software, it has been widely used in Japan. A number of attacks have been 
developed against it, which has led to revised versions with more rounds and larger keys. 
IDEA 

a 64-bit, 8 round iterated block cipher with a 128-bit key, designed by X. Lai and J. Massey in 
1990. It uses the concept of "mixing operations from different algebraic groups". Best known 
for its use in PGP, SSH and other systems, it is one of the more widely used algorithms at the 
present time, and is still believed secure. It is patented though and licensing may be needed. 
LOKI91 

a 64-bit, 16 round, symmetric block cipher with a 64-bit key. It was originally designed by 
Brown, Pieprzyk and Seberry in 1990 [BrPS90], and later re-designed (and renamed 


68 


A Current Perspective on Encryption Algorithms 



L0KI91), with improved security in 1991 [BKPS91]. It is also believed secure, and is used by 
several companies wanting a licence and patent free Australian algorithm. 

Also around this time, Biham and Shamir announced the (re)discovery (in the public arena) of 
differential cryptanalysis - a powerful general technique for attacking iterated block ciphers 
[BiSh93]. Subsequently, Matsui announced the discovery of linear cryptanalysis [Mats93]; and 
Biham of related key attacks [Biha94c]. All of these attacks utilise knowledge of the structure of the 
cipher to accumulate information from extremely large numbers of observed encryptions to obtain 
information about the (unknown) key used. The success of these attacks depends on the number of 
observed encryptions, and decreases rapidly with increasing numbers of rounds. Eventually a 
break-even point is reached where exhaustive search is faster, and even where the number of 
observations required is greater than the total number possible. Thus these attacks are possible and 
(theoretically) faster against some ciphers, eg DES or some FEAL versions, and not against others 
(eg IDEA, LOKI91). 

In the light of this new (in the public community) knowledge, and experience with the preceding 
ciphers, a number of others were developed during the early 90’s. The better known of these 
include: 

BLOWFISH 

a 64-bit, 16 round feistel block cipher with a variable length key, designed by B. Schneier in 
1994. It is optimised for bulk data encrpytion (since key changes are slow), and uses four 
large 8*32 bit random substitution boxes generated from the supplied key, whose outputs are 
combined using addition and xor. Used in SSH and other packages, this cipher is also well 
known, and is unencumbered by patent or licencing issues. 

CAST 

a 64-bit, 8 round feistel block cipher with a 64-bit key, designed by C. Adams and S. Tavares 
in 1993. It uses six 8*32 bit substitution boxes (some with data in, some with subkey in) 
whose outputs are combined using xor. These substitution boxes are designed for and fixed 
per application. It is used in a number of products in Canada (where it was designed), and 
may need to be licenced. 

RC2, RC5 

two private key block ciphers developed by RSA Data Security Inc. The details of both are 
proprietary, though the RC5 design has subsequently been published and subject to some 
analysis. They are parameterisable, with various sized data and keys, and number of rounds, 
possible. These ciphers have been widely licenced and used in products such as Netscape. 
However because of their proprietary nature, public analysis has been limited (nb. RC4 is a 

stream cipher by the same people, also widely used, with details released due to its reverse 
engineering). 

SAFER 

a 64-bit, 6 or higher round, iterated block cipher with 64 or 128 bit keys, designed bv J 
Massey in 1994. 

SQUARE 

a 128-bit, 8-round block cipher, designed by Joan Daemen and Vincent Rijmen in 1997, with 
emphasis on resistance against differential and linear cryptanalysis. 

TEA 

a simple 64-bit, 32 round feistel block cipher with a 128-bit key, designed by Wheeler & 
Needham [WhNe94] in 1994. It uses a very simple round function composed of alternating 
additions and xor s along with left and right shifts, iterated over many rounds to provide 
sufficient complexity. Has a surprising degree of security for such a simple cipher. 


A Current Perspective on Encryption Algorithms 


69 



As these ciphers are newer, there has not been as much opportunity for analysis. Some have already 
been adopted in a number of systems though. 

Most are also characterised by using the same 64 bit data block size as DES, and either 64 or 128 
bit keys (or 40 when lobotomised to insecurity by US export laws). This is changing in the next 
generation being developed for the AES call, as discussed below. 

Comparative Speeds 

As an illustration, a number of these algorithms have been implemented in Java (using the Cryptix 
library [Cryp97]). In one run, interpreted on a Pentiun2/266 Linux system, the following 
comparative times were obtained: 

Table of Comparative Block Ciphers Timings 


IJCE Timing 

Encryption 

(1MByte) 

Key Init 

Algorithm 

Time (ms) 

Rate (Kbps) 

1000 pairs (ms) 

Blowfish 

20506 

409 

79290 

CAST5 

23772 

352 

976 

DES 

48629 

172 

519 

TripleDES 

160807 

52 

1790 

IDEA 

43409 

193 

734 

LOKI91 

31071 

269 

76 

RC2 

43329 

193 

790 

RC4 (*) 

12945 

648 

2382 

SAFER 

41442 

202 

2219 

Square 

29610 

283 

2166 


nb. (*) RC4 is stream cipher; 

Choosing a Cipher 

Selecting between the competing ciphers is not easy. Apart from slightly different currently known 
security levels, questions of speed, code size, and any patent or licencing issues need to be 
considered. As a general rule, do rely on ciphers that are known, have been publically analysed, and 
in some use. Different applications are likely to require different tradeoffs and hence choices. 

Block Ciphers Future 


Given the rapid advances in hardware speeds, and the development of new cryptanalytic 
techniques, it has become clear that firstly the DES must be replaced soon, and that a general 
increase in the block and key sizes used, is needed. Thus the AES call. 


Advanced Encryption Standard (AES) 

The US NIST issued a request for possible candidates in Sept 97 [AES97]. As noted (by Bruce 
Schneier) at a meeting discussing the proposed call: 

Miles Smid presented NIST’s goals for AES. They wanted a strong block encryption 
algorithm for government and commercial use, one that would support "standard 
codebook modes" of encryption, "significantly more efficient than Triple-DES," and 


70 


A Current Perspective on Encryption Algorithms 







with a variable key size. 


In the call, they note: 

It is intended that the AES will specify an unclassified, publicly disclosed encryption 
algorithm available royalty-free worldwide that is capable of protecting sensitive 
government information well into the next century. 

AES Requirements 

NIST have specified that proposed algorithms must implement a symmetric block cipher, with a 
block size of 128 bits, and keys sizes of 128, 192 and 256 bits (at least). They want an algorithm 
whose security is at least as good as Triple-DES, but with significantly improved efficiency. 

Submissions were due by June 15, 1998. These submissions included a complete specification of 
the algorithm, its design rationale, computational efficiency, and expected strength against known 
forms of attack. Also included were reference implementations in Java (based on the Sun JCE 1.2 
framework) and C, as well as extensive validation and test values. 

Following the submission deadline, NIST reviewed all submissions for compliance with the 
minimum requirements, and then made them publically available at the First AES conference in 
mid-August 1998, and invited analyses. 

AES Evaluation 

Then there will be two phases of public evaluation. During the first phase, NIST will evaluate the 
algorithms, including using publically submitted evaluations, and select a shortlist. These will then 
be more extensively evaluated during the second phase to select the final candidate algorithm. In 
the evaluations, security is the prime consideration, then efficiency and flexibility. NIST have 
committed to releasing all non-classified evaluations of the candidate algorithms. 

AES Candidates 

NIST have a policy of not naming any proposed candidates until the First AES conference in 
August 1998. At the time of writing, it is known that NIST have accepted 15 algorithms for 
evaluation in phase 1. The following are currently known: 

CAST-256 - Adams (Entrust Tech., Canada) 

A 48-round, unbalanced feistel cipher using the same round functions as CAST-128, with the 
key schedule also being an unbalanced feistel cipher, see [Adam98]. 

CRYPTON - >Lim (Future Systems, Korea) 

A 12-round self-invertible cipher using byte substitutions, bit permutations and byte 
transpositions, with a block length of 128 bits and key lengths up to 256 bits, see [Lim98] 
DEAL - Knudsen (U. Bergen, Norway), Outerbridge (UK) 

A rather different proposal, a 6 to 8 round feistel cipher which uses the existing DES as the 

round function. Thus a lot of existing analysis can be leveraged, but at a cost in speed, see 
[Knud98] 

DFC - Gilbert, Girault, Hoogvorst, Noilhan, Pomin, Poupard, Stem, Vaudenay (ENS, France) 

An 8-round feistel cipher designed based on a decorrelation technique, with a 4-round kev 
schedule, see [GGHN98], ^ 

E2 - NTT, Japan 


A Current Perspective on Encryption Algorithms 


71 



A 12-round feistel cipher, using a non-linear function comprised of substitutions, XOR 
mixing operations, and a byte rotation, see [NTT98] 

FROG - Georgoudis, Leroux, Chaves (TecApro, South Africa) 

An 8-round cipher, with each round performing 4 basic operations (with XOR, byte 
substitution, table value replacement) on each byte of its input, see [GeLC98] 

Hasty Pudding - Schroeppel, Orman (U. Arizona, USA) 

A variable length cipher for any data and key size, using a number of variants all loosely 
derived from the feistel structure. It uses a number of rounds, which modify 8 internal 64-bit 
variables as well as the data, see [ScOr98] 

LOKI97 - Brown, Pieprzyk (ADFA/UOW, Australia) 

A 16-round balanced feistel cipher using a complex function f with two S-P layers, and a 
256-bit key schedule using 48 rounds of an unbalanced feistel network using the same 
complex function f, see [BrPi98a] 

MARS - IBM, USA 

A multi-round cipher with 4 distinct phases, key addition and 8 rounds of unkeyed forward 
mixing, 8 rounds of keyed forwards transformation using an unbalanced feistel structure, 8 
rounds of keyed backwards transformation, 8 rounds of unkeyed backwards mixing and keyed 
subtraction, see [IBM98]. 

RC6 - Rivest, Robshaw, Sidney, Yin (RSA Labs/MIT, USA) 

A fully parameterised cipher, developed from RC5, which uses a number of 32-bit operations 
(add, sub, xor, left & right rotations) to mix data in each round, see [RBSY98]. 

RIJNDAEL - Daemen, Rijmen (Banksys/Katholieke Universiteit Leuven, Belgium) 

A 10 to 14-round cipher, using byte substitution, row shifting, column mixing and key 
addition, as well as an initial and final round of key addition, derived from the previous 
SQUARE cipher, see [DaRi98]. 

SAFER+ - Massey (Cylink, USA) 

An iterated block cipher, derived from the earlier SAFER, see [Mass98]. 

SERPENT - Anderson (Cambridge UK), Biham (Technion Israel), Knudsen (U. Bergen Norway) 

A 32-round feistel cipher, with a key mixing operation, substitutions, and a linear 
transformation in each round, see [AnBK98]. 

TWOFISH - Schneier, Kelsey, Whiting, Wagner, Hall, Ferguson (Counterpane Systems, USA) 

A 16-round feistel cipher using 4 key dependent S-boxes, matrix transforms, bitwise 
rotations, based in part on Blowfish, see [SKWW98]. 

At this stage it is too early for much analysis to have been done on these algorithms. However you 
can certainly expect to hear about some of them in the coming year or so as the field is narrowed 
down and a final candidate chosen. If you are designing products which use cryptographic 
algorithms, you certainly ought to be planning to incorporate the final AES algorithm in your 
product. This requires planning for the use of 128-bit data blocks, and one or more of 128, 192 or 
256 bit keys. 

Comparative Speeds 

At the time of writing, the Java implementation of a few of these algorithms was available. The 
timing runs trialed each algorithm with each of the 3 required AES key sizes. In one timing run, 
interpreted on a Pentiun2/266 Linux system, the following comparative times were obtained. 

Table of Comparative AES Ciphers Timings 

NIST Timing Encryption (1MByte) Key.Init 

Algorithm Time (ms) Rate (Kbps) 1000 pairs (ms) 


72 


A Current Perspective on Encryption Algorithms 







DEAL/128 

144567 

58 

5969 

DEAL/192 

144481 

58 

6025 

DEAL/256 

183245 

45 

7594 

LOKI97/128 

159629 

52 

4104 

LOKI97/192 

159118 

52 

4142 

LOKI97/256 

159499 

52 

4020 

Rijndael/128 

30115 

278 

1738 

Rijndael/192 

32649 

256 

1924 

Rijndael/256 

34863 

240 

2140 

Serpent/128 

515656 

16 

8970 

Serpent/192 

516348 

16 

8955 

Serpent/256 

515716 

16 

8925 

Twofish/128 

60977 

137 

7831 

Twofish/192 

61086 

137 

10280 

Twofish/256 

61159 

137 

12311 


Clearly there are some significant speed differences between even these few algorithms, which 
needs to be balanced against possibly varying levels of security, as continuing analysis should 
reveal. 

Other Observations 


Some personal observations about other aspects of cryptographic algorithms. 

Key escrow is dead! It has been overtaken by events (including the AES), is simply not wanted by 
many around the world, and is bypassed by the widespread availability of good encryption products 
without it. However, key backup is a useful idea, used voluntary by organisations for good 
management reasons. This will become more common, and supported, in products. It is distinct 
from key escrow in that it applies to archival data and key access, not communications sessions. 

40-bit block cipher keys are manifestly insecure! Repeated demonstrations have illustrated that they 
can be broken by brute-force in less than a few days by any group with access to a reasonable pool 
of workstations (eg a few labs at the Uni). More recently, we have seen that even 56-bit DES keys 
are insecure given a group willing to invest a moderate (US$250k) amount in building dedicated 
hardware to search for any keys in a few days [EFF98a]. It is clear, that as discussed by a group of 
cryptographers in 96 [BDRS96], key lengths should be 75 bits or more for moderate term security. 

To support wide-spread interaction amongst large communities, we need (public key) certificate 
authorities, and we need them now! Its time to stop arguing about the politics, and do it! 

Also to improve security, good, strong, encryption is needed now in many products, particularly 
those involving network communications. With the development of new algorithms, and their 
public evaluation, this should be easier than ever before, if only people will address the political 
issues and accept its necessity. 

Conclusions 


This paper presents a brief survey of block ciphers, their past, and their further development in 
response to the AES call. Continuing changes and developments can be expected in the short term 
though the selection of a final AES algorithm should give some direction in the medium term. 


References 


A Current Perspective on Encryption Algorithms 


73 



AES 97 

NIST, "Advanced Encryption Standard Call", NIST, 1997. 

http://csrc.ncsl.nist.gov/encryption/aes/aes_home.htm. 

Adam^riisie „ The CAST _ 256 Encryption Algorithm", Entrust Technologies, Canada, AES 

submission, Jun 1998. http://www.entrust.com/resources/pdf/cast-256.pdf. 

AnBK98 , . T , TT 

Ross Anderson, Eli Biham, Lars Rnudsen, "SERPENT", Cambridge UK, Techmon Israel, U. 

Bergen Norway, AES submission, Jun 1998. 
http://www.cl.cam.ac.uk/ftp/users/rjal4/serpent.pdf. 

BDRS96 J 

M. Blaze, W. Diffie, R. Rivest, B. Schneier, T. Shimomura, E. Thompson, and M. Werner, 
"Minimal Key Lengths for Symmetric Ciphers to Provide Adequate Commercial Security", 

Jan 1996. http://www.counterpane.com/keylength.html. 

BKPS91 . 

Lawrence Brown, Matthew Kwan, Josef Pieprzyk, Jennifer Seberry, "Improving Resistance to 

Differential Cryptanalysis and the Redesign of LOKI", in Advances in Cryptology - 
Asiacrypt’91, Lecture Notes in Computer Science, Vol 739, Springer-Verlag, pp 36-50, 1991. 
Also issued as ADFA TR CS38/91. 

BiSh93 

Eli Biham, Adi Shamir, "Differential Cryptanalysis of the Data Encryption Standard", 
Springer-Verlag, New York, 1993. 

Biha94c 

Eli Biham, "New Types of Cryptanalytic Attacks Using Related Keys", Journal of 
Cryptology, Vol 7, No 4, pp 229-246, 1994. 

BrPS90 ,. _ . . . , 
Lawrence Brown, Josef Pieprzyk, Jennifer Seberry, "LOKI - A Cryptographic Primitive for 

Authentication and Secrecy Applications", in Advances in Cryptology: Auscrypt ’90, Lecture 
Notes in Computer Science, Vol 453, Springer-Verlag, pp 229-236, 1990. Also issued as 
ADFA TR CS1/90. 

BrPi98a , „ , „ . .. 

Lawrie Brown, Josef Pieprzyk, "Introducing the New LOKI97 Block Cipher , Australian 

Defence Force Academy, Canberra, Australia, AES Submission, Jun 1998. 
http ://w ww .adfa.edu. au/~lpb/research/loki97/ 

Cryp97 

Ian Grigg, Raif Naff ah, "Cryptix Java Cryptographic Library", 1997. 

http://www.systemics.com/docs/cryptix/. 

DaRi98 TT . .. .. 

Joan Daemen, Vincent Rijmen, "AES Proposal: Rijndael", Banksys/Katholieke Umversiteit 

Leuven, Belgium, AES submission, Jun 1998. 

http://www.esat.kuleuven.ac.be/~rijmen/rijndael/ 

EFF98a 

Electronic Frontier Foundation, "Cracking DES", O’Reilly, 1998. 1-56592-520-3. 
http://www.oreilly.com/catalog/crackdes/. 

GGHN98 „ . _ 

Henri Gilbert, Marc Girault, Philippe Hoogvorst, Fabricc Noilhan, Thomas Pomin, Guillaume 
Poupard, Jacques Stem, Serge Vaudney, "Decorrelated Fast Cipher: an AES Candidate", 
Ecole Normale Superieure, France, AES submission, Jun 1998. 
http://www.dmi.ens.fr/~vaudenay/dfc.html 
GeLC98 

Dianelos Georgoudis, Damian Leroux, Billy Simon Chaves, "The "FROG" Encryption 


74 


A Current Perspective on Encryption Algorithms 



Algorithm", TecApro Inti., South Africa, AES submission, Jun 1998. 
http ://w w w. tecapro. com/aesfrog. htm. 

IBM98 

Carolynn Burwick, Don Coppersmith, Edward D’Avignon, Rosario Gennaro, Shai Halevi, 
Charanjit Jutla, Stephen M. Matyas Jr., Luke O’Connor, Mohammad Peyravian, David 
Safford, Nevenko Zunic, "MARS - a candidate cipher for AES", IBM, USA, AES 
submission, Jun 1998. http://www.research.ibm.com/security/mars.html. 

Knud98 

Lars Knudsen, "DEAL: A 128-bit Block Cipher", University of Bergen, Norway, Technical 
Report, No 151, Feb 1998. http://www.ii.uib.no/~larsr/newblock.html 
Lim98 

Chae Hoon Lim, "CRYPTON : A new 128-bit block cipher", Future Systems, Korea, AES 
submission, Jun 1998. http://crypt.future.co.kr/~chlim/crypton.html 
Mass98 

James Massey, "SAFER+", Cylink, USA, Jun 1998. http://www.cylink.com/SAFER. 

Mats93 

Mitsuru Matsui, "Linear Cryptanalysis Method for DES Cipher", in Advances in Cryptology - 
Eurocrypt’93, Lecture Notes in Computer Science, Vol 765, Springer-Verlag, pp 386-397, 
1993. 

NTT98 

Kazumaro Aoki, Masayuki Kanda, Tsutomu Matsumoto, Shiho Moriai, Kazuo Ohta, Miyako 
Ookubo, Youichi Takashima, Hiroki Ueda, "E2 - a 128-bit Block Cipher", NTT, Japan, AES 
submission, Jun 1998. http://info.isl.ntt.co.jp/e2/. 

RBSY98 

R. Rivest, M.J.B. Robshaw, R. Sidney, Y.L. Yin, "The RC6 Block Cipher", RSA Labs/MIT, 
USA, AES submission, Jun 1998. http://theory.lcs.mit.edu/~rivest/rc6.pdf 
RSAD97 

RSA Data Security Inc, "Government encryption standard DES takes a fall", 1997. 
http://www.rsa.com/des/. 

SKWW98 

Bruce Schneier, John Kelsey, Doug Whiting, David Wagner, Chris Hall, Niels Ferguson, 
"Twofish: A 128-bit Block Cipher", Counterpane Systems, USA, AES submission, Jun 1998. 
http://www.counterpane.com/twofish.html. 

ScOr98 

Rich Schroeppel, Hilarie Orman, "An Overview of the Hasty Pudding Cipher", University of 
Arizona, Arizona, USA, AES submission, Jun 1998. http://www.cs.arizona.edu/~rcs/hpc/ 
Schn96 

Bruce Schneier, "Applied Cryptography - Protocols, Algorithms and Source Code in C", 2nd 
edn, John Wiley and Sons, New York, 1996 
Stal98 

W. Stallings, "Cryptography and Network Security - Principles and Practice", 2nd edn, 
Prentice-Hall, 1998. http://www.shore.net/~ws/Security.html 
WhNe94 

David J. Wheeler, Roger M. Needham, "TEA, a Tiny Encryption Algorithm", in Fast 
Software Encryption 2, Lecture Notes in Computer Science, Vol 1008, Springer-Verlag, pp 

363-366, 1994. http://www.cl.cam.ac.uk/ftp/papers/djw-rmn/djw-rmn-tea.html. 


The latest version of this paper may be found at: http://www.adfa.edu.au/~lpb/papers/auu298/ 
Last updated: 31 Jul 98. 


A Current Perspective on Encryption Algorithms 


75 


























































76 


A Current Perspective on Encryption Algorithms 



Which One is for You : 


Issues of Trust in Digital Signature Certificates 


Yinan Yang 

National Library of Australia 
Email: yyang@nla.gov.au 
Dr. Lawrie Brown 
School of Computer Science, 
Australian Defence Force Academy, UNSW 
Email: Lawrie.Brown@adfa.oz.au 
Dr. Jan Newmarch 
Information Sciences and Engineering 
University of Canberra 
Email: jan@ise.canberra.edu.au 


Abstract 


Trust is an increasingly important concept on the Internet, especially for Electronic Commerce. 
There are a number of trust-models on the Internet providing authentication which attempt to 
achieve the maximum of trust with minimum of risks. These include: X.509 standard Public Key 
Infrastructure (PKI), other PKI such as Pretty Good Privacy (PGP), the Simple Public Key 
Infrastructure (SPKI) and a Simple Distributed Secure Infrastructure (SDSI). These models use 
public key encryption techniques, certificates, and digital signatures. A certificate is used as a 
trust-token between different parties on the Internet to tell others you are really who you say you 
are. This work is a survey of different existing trusted models, their certificates, their structures 
ways of handling transitivity of trust and Certificate Revocation Lists (CRL). 


1. Introduction 

Doing business on the Internet poses new service relationships among parties. The 
infrastructure of doing business is changing: the conventional service-relationship among 
parties and the usual verification of legitimate business parties are changing. Among all 
parties on the Internet, the trust relationship is under threat from spoofing, eavesdropping 
modification, masquerading and insecure transmission. The key to the trust relationship 


Which One is for You : Issues of Trust with Public Key Certificates 


77 



among business parties (or entities) is to be able to verify the true identity of each other. 
Authenticity is an important factor of trust on the Internet. 

Trust has been a focus among professionals and their Internet communities. Governments 
and individuals (policy makers, cryptographers, economists and users) have pooled their 
resources to define and develop a safe environment on the Internet. Public Key Infrastructure 
(PKI) provides an authentication framework, ie. a way of authenticating parties to establish a 
trust relationship. It also includes managing public keys and registrations using digitally 
signed certificates. Digital signature certificates are becoming increasingly important 
technical mechanism for trust management on the Internet. 

2. Background 

About 100 companies and organisations in the world (including Australia’s government 
public key authority [3]) provide certification services and certification products based on 
various trust models. These different Public Key Infrastructure (PKI) trust models include the 
X.509 standard PKI trust model, Pretty Good Privacy (PGP), Simple Distributed Secure 
Infrastructure (SDSI) and Simple Public Key Infrastructure (SPKI). 

The term certificate was proposed for the first time in 1978 by Loren M. Kohnfelder in his 
bachelor thesis at MIT [1]. Since then, the certificate has been used as a binding between a 
name and a public key. The binding is done by a digital signature after certifying the user s 
public key. This certificate must be signed by a certificate authority (CA) or a person in 
whom the certificate user has some degree of trust. The CA and the person are also known as 
the "trusted third party" (TTP). The certificate user verifies certificates by verifying TTP 
digital signatures (which can only have been created using the TTP s private key). If the 
signatures are "good", the certificate user has confidence that the public key of the entity in 
the certificate belongs to the entity who claims it, with the degree of confidence depends on 
how greatly the CA is trusted to have verified this relationship. 

A certificate contains various bits of information: 

• information about an entity: email address, public key; 

• information about a certificate issuer: issuer-signature (the issuer’s private key), 
issuer-public-key; 

• information about the certificate itself: date and time of certificate issue, 
certificate-serial-number, validity-period, certificate-fingerprint; 

• extra information: version certificate, CRL version, class and type of the certificate and 

comments. 


There are different types of certificate, such as identity certificate, credential certificates and 
attribute certificate (ANSI draft standard X.9.45) [8], A variety of applications [17] use 
certificates to protect personal email (eg. Personal ID), transacting gateways, web sites, 
securing connection (eg. server ID) and authenticity of publications on the Internet (eg. 
Developers ID). There are also different classes of certificates providing different assurance 
levels. Some examples of different certificates include personal certificates, server (or site) 
certificates and developer certificates [16], 


In general, the certificate provides a statement by a trusted third party (eg. Issuer Authority or 


78 


Which One is for You : Issues of Trust with Public Key Certificates 



Certificate Authority or a person) that the signed key is authentic and truly belongs to the 
subject (ie. person, organisation, site, server) who claims it. 

For example, if Alice wants to send a message to Bob using Communicator’s messenger: 

1) Alice subscribes a membership (class 1) for a certificate from a CA by filling in a form 
including her email; 

2) A key pair is generated as part of process of subscribing to the certificate (the private key 
can be protected by a password); 

3) The CA verifies Alice’s ID (ie. email) by sending a PIN using her email address; 

4) After Alice receives the email with the PIN, she retrieves her certificate from the CA using 
the PIN. The certificate now is residing in Alice’s Communicator and messanger applications 
and is ready for use; 

Meantime, Bob perfoms a similar process to get his certificate. 

5) Alice signs her message with her certificate and sends it to Bob; 

6) After Bob receives Alice’s signed message, he can reply to her with a signed message with 
his certificate; 

7) After exchange certificates, Bob and Alice can send signed and encrypted (S/MIME) 
emails to each other. 

Alice and Bob have full confidence that their email is truly coming from the expected source. 
Their confidence has been vouched by the CA, where their certificates have been signed by. 

From the above example, Alice and Bob subscribed their certificates from the same CA. What 
if they were using different CAs ? How would they verify each other’s certificates? 

Different trust models have demonstrated both conceptual and implementation 
commonalities and differences in the areas of authentication framework structure, transitivity 
of trust, certificate revocation lists (CRL), and certificate policy (including certificate practice 
statement). Different trust models have a different focus on the trust relationship, carry 
different information in a certificate, and provide different levels of trust. Their organisational 
and technical implementation also varies. 

The following table summarises a number of differences in trust management among different 
trust models: 


Models : 

X.509 

PGP 

SDSI & SPKI 


Which One is for You : Issues of Trust with Public Key Certificates 


79 




Certificate type 

1) DN based cert; 

2) Credential based cert (v.3) 

Email-address-based 

cert 

1) Id cert; 

2) 1 

group-membership 

cert; 

3) 

local-name-binding 
cert (both ID and 
credential-based) 

Capability 

Authentication; 

non-repudiation 

Authentication; 

privacy 

- Authentication; i 

- Some Privacy; 

- Authorisation 
(credential: ground 
certain operations) 

' Structure of 
trust 

- Hierarchy (with Web 
capability) 

Web only i 

Web (via Groups; 
Sets; Principals) 

Level/Class of 
trust 

Classes (eg. Class 1, Class2, 
Class3) 

l=don’t know; 

2=No; 

3=Usually; 4=Yes. 

- egalitarian 

- co-sign by several 
principals 

Transitivity of 
| trust 

via certification path: 

- Hierarchy 

- Network 

l=don’tknow 

==>undefined; 

2=No ==>untrusted; 

3=Usually 

==>marginal; 

4=Yes ==>complete 
and combining rules 
(eg. 3+3=4). 

- delegation 
certificate; 

- groups : set of 
trusted 
membership 

Certificate 

verification 

1) Hierarchy path : 
unidirectional 

2) Network path : 

bi-directions 

(CrossCertificatePair) 

- checking 
categories: 
l=undefined; 

2=untrusted; 

3=marginal; 

4=complete and 

ultimate (own key) 

- searching 
membership in a 
trusted group 


80 Which One is for You : Issues of Trust with Public Key Certificates 

















CRL 

-periodically updating CRL 
(eg. daily for class2,3 in 

- words of mouths 

- expiration dates; 


chap9 of CPS) 

(sending a note to 
others) 

- revalidation; 




- periodic 
re-confirmation. 


(Table 1) 


3. X.509 PKI trusted Model 

Cryptography provides secure communications over the Internet without prior exchange of 
encryption/decryption keys. Public key cryptography has been recommended for use with the 
ISO authentication framework. This framework, also known as the X.509 protocol, provides 
for authentication across networks. 

The most important part of X.509 is its structure for public-key certificates, which is an 
identity-based certificate. The user’s public key is bound with the user’s distinguished name 
(ie. global unique name space). This certificate must be signed and issued to a particular 
entity by a certificate authority (CA or Issuing Authority) under provision of its certificate 
policies and Certificate Practice Statement (CPS). 

3.1 X.509 PKI Authentication Structures 

In X.509 version 1 and 2, the X.509 PKI model was primarily a hierarchical structure. The 
X.509 version3 introduced some extensions allowing the management of policies and trust 
relationships in a web or networked structure [9]. An X.509 trust model has formal and 
complex organisational structure. It usually consists of the policy approving authority (PAA), 
the policy certificate authority (PCA), the certificate authority (CA), the organisational 
registration authorities (ORA) and the trusted third party (TTP). 



• PAA sets the overall guidelines for the entire PKI. It may also certify public keys belonging 


Which One is for You : Issues of Trust with Public Key Certificates 


81 












toPCA. 

• PCA establishes policies for all certification authorities and users within its domain. It 
certifies a CA’s public keys. 

• CA has minimal policy making responsibilities. The CAs certify the public keys of users in 
alignment with PCA’s and PAA’s policies. 

• ORA is an entity which sits between a CA and a user, to vouch for the identity and affiliation 
of the user and register that user with its CA. 

• TTPs provide an emergency encrypted data access and recovery service. TTPs store a copy of 
management private keys and release them to authorised parties (for escrow) under an 
emergency. 


A PKI functional structure consists of certificate issuance, certification archiving, certificate 
revocation and policy approval. A number of security servers and agents, which include the 
certified delivery server, the digital notary server, ticket granting agent and TTP. They 
perform tasks such as confidentiality key exchange, key pair generation, key management, 
digital signature verification, digital signature generation to provide authentication, integrity 
and non-repudiation [10]. 


3.2 X.509 PKI’s Transitivity of trust 


The trust in the X.509 compliant PKI hierarchical-trusted model is transferred along a set of 
certificates that provides a chain of trust. The chain of certificates can be of arbitrary length 
[5]. To establish a trust between two entities, the verifier carries out the process of verifying 
the signatures on a chain of certificates until reaching the certificate which the verifier trusts. 


A list of certificates needed to allow a particular user to obtain the public key of another, is 
known as a "certification path". Each CA has a list of certificates. Some of the certificates are 
named as "forward certificates" which are issued by other CAs (eg. a superior CA in the 
hierarchy). Some of the certificates are named as "reverse certificates" which are issued by 
the CA itself. This list of certificates enable two users to mutually authenticate by 
constructing a certification path which forms an unbroken chain of trusted nodes in a 
hierarchy structure. 


82 


Which One is for You : Issues of Trust with Public Key Certificates 




(Figure 2 [5]) 

"A<X>,X<W>" means W signed the certificate of X, and X signed the certificate of A, where 
W, X, A are certificate authorities. It is important to know that the level of trust provided by 
certificates "A<X>,X<W>" is the same as "A<X>". In other words, possession of 
"A<X>,X<W>" provides the same capability as "A<W>" (ie. A<X>,X<W> =A<W>), 
namely the ability to find out the public key of W (ie. a CA) given the public key of A (ie. a 
CA). 


The certification path (chains of certificates) can be constructed either in a hierarchical or a 
network way. 


In a hierarchical certification path, a "root" CA issues certificates to its subordinate CAs and 
these CAs issues certificates to their subordinate CAs and so on. The "root" CA is regarded as 
the most trustworthy CA in the hierarchical certification path and every user must know the 
root CA’s public key. Any user’s certificate may be verified by following the certification 
path till the verifier reaches to the root CA (eg. a common trusted CA or the CA who issued 
the user’s certificate). 


The X.509 compliant PKI hierarchical-trust model has some advantages and disadvantages. 


Advantages : 

O The hierarchical-trust model is aligned with the structure of large organisations where 
management of security policies and trust relationships are in a tree structure (eg. 
government departments); 

O Trust relationships are frequently aligned with the organisational structure, such as in a 
government department, or commercial organisation, so it is natural to adapt the 
certification path with the existing organisational hierarchical-trust structure. 

O The hierarchy represents a directory name structure (aligned with X.500). 

O The certification path search strategy is straightforward, since it uses existing 


Which One is for You : Issues of Trust with Public Key Certificates 


83 







organisational and directory structures. 

O Users in a hierarchy structure may only require to obtain the certificate paths from the 
common point of trust, because users have known the public keys and certificates 
between the user and the root of the CA. 

Disadvantages : 

O Many commercial organisation’s and companies’ trust relationships are not hierarchical 
structures. 

O Propagating a new root public key on a large scale is very costly in time and money. 


In a network certification path, CAs use CrossCertificates to cross-certify each other. In other 
words, each CA issues the other a certificate and combines the two certificates in its directory. 



(Figure 3) 

The CrossCertificate supports chains of trust in both directions (unlike a hierarchical 
certification path which is unidirectional). To verify the other users’ certificates, the user 
verifies a certification path of certificates that leads back to its own trusted CA, where the 
user knows its own CA’s public key. 

Advantages : 

• The CrossCertificate provides more flexible bilateral trust relationships which benefits most 
business organisations (eg. CAs whose users communicate frequently can store 
CrossCertificates in each others’ directories) 

• The CrossCertificate reduces the amount of information for some instances of authentication, 

• Once two frequently communicating users have learned each other’s certificates, they can use 
each other’s public key directly. 

Disadvantages : 

• there are more alternative certification paths which can make searching strategies complex, ie. 
a single certification path is hard to obtain. 


Back to the early question: "If Alice and Bob subscribed their certificates from different CAs, 
how do they verify each other’s certificates?": 


84 


Which One is for You : Issues of Trust with Public Key Certificates 






1) According to the X.509 PKI trust model, Alice (where A is) only needs to obtain X<Z>, 
Z<B> to form the certification path, and Z<X> to form the return certification path. Bob 
(where B is) only needs to obtain Z<X>, X<A> to form the certification path, and X<Z> to 
form the return certification path (refer Figure 2). 

2) If A is Alice and C is Bob, both know X’s public key, so Alice (where A is) directly 
acquires the certificate of Bob (where C is). Alice only needs X<C> to form the certification 
path and X<A> to form the return certification path. Bob (where C is) directly acquires the 
certificate of Alice (where A is) and obtains X<A> to form the certification path and X<C> to 
form the return certification path. 


3.3 CRLs in X.509 PKI 


A CRL(certificate revocation list) is a list of revoked certificates and must be signed by a 
valid CA. When a private key is compromised, or the person who holds the certificate leaves 
the issuing organisation, it is time to revoke his/her certificate. A CRL has records of the time 
when the CRL was issued, the time the next CRL will be issued, and serial numbers of 
revoked certificates which are unique for the CRL issuer. So CA verifiers periodically 
retrieve CRLs from repositories. Users can check CRLs against a certificate for its validity. In 
X.509 version3, some extensions have been proposed to improve the effectiveness of CRLs 
(eg. time to live)[5]. 


In summary, the X.509 (early versions and version3) compliant trust model has wide 
acceptance from industries and many products are currently on the market [19]. Compared to 
other trust models, it has a well documented PKI both in organisational and functional 
aspects, which is important for inter-operability among certification products; although some 
may argue that the model is trying to formalise too many things. 


4. PGP Trust Model 


Pretty Good Privacy (PGP) use the web of trust model. The web of trust model, permits each 
user to certify a binding between the particular public key and the particular user-ID. The 
user-ID in PGP consists of a name and email address. The email address provides a global 
name space, which is similar to a Distinguished Name in X.509. 


The PGP’s key certificates contains : the public key; one or more user IDs for the key’s 
creator; the date that the key was created; and the list of digital signatures on the key 
(optionally provided by people who attest to the key’s veracity) [13], 


Which One is for You : Issues of Trust with Public Key Certificates 


85 



4.1 PGP’s Transitivity of Trust 


In PGP, the trust is transferred through each person’s own belief of whether the other person 
is trustworthy or not. 

A measure of how much you believe the honesty and judgement of the person 
who created the key. The more you trust a key, the more you trust the person who 
created the key to certify other people’s keys. — page 235 of PGP [13] 


In PGP, a web of trust is constructed by establishing a list of validated public keys on its 
keyring. Each key certificate on a PGP public key ring has validity and trust parameters. The 
validity is an indication of whether you believe the key belongs to the person (eg. complete). 
The trust is a measure of how much you believe the person who created the key (eg. marginal, 
complete). If you trust the person more, then you may trust him/her to certify other people’s 
keys (refer to the table 1). 

Advantages : 

• Easy to down-grade the level of trust: one may decide to change the level of trust in the 
person, for any reason (eg. other person indiscriminately signs any key; the person’s 
judgement appears suspect); 

• Easy to upgrade the level of trust: there may be a number of reasons (deeper understanding of 
the person; means of signing a key; sudden return from long leave of a trusted person, 
technological advances); 

• Self-control: PGP allows you to change the level of trust at one’s finger tips, there is no 
formal requirement to justify your actions; 

• Easy to obtain: PGP can be easily obtained from the Internet and can be used with number of 
mail client applications, such as Pine, Eudora. 

Disadvantages: 

• One of the limitations of PGP is that its web trust structure restricts its expansion of trust (ie. 
scalable trust). However, with some organisations willing to vouch PGP users public keys, it 
allows certain degrees of scaleable trust (eg. AUUG provides a CA service for PGP keys for 
its members); 

• PGP users must sign their own keys (or have them signed by a CA) to avoid the possibility of 
someone altering the true identity on the public key. This step is not part of the automatic 
installation process. Some degree of PGP technical knowledge is required (ie. lack of user 
transparency) to achieve maximum security. 


Example, if Alice knows Carl via Bob’s signature, then how much trust Alice can place on 
Carl? 


86 


Which One is for You : Issues of Trust with Public Key Certificates 



Alice can make her own decisions about the level trust on Carl at the time in the range of 
"undefined, untrusted, marginal and complete" depending on the trust in Bob at the time. 


4.2 CRLs in PGP 


In PGP, there is no central repository of certification revocation lists (CRL). Every PGP user 
can revoke its own public key and others’ public keys, which is unlike the X.509 compliant 
PKI’s revocation mechanism. When a person wants to notify others that his/her public key 
has been compromised, the key’s owner can issue a key revocation certificate (PGP -kd 
myself) and extract it from the public key ring and send to his/her correspondents. 


This autonomy notification process of CRL may not be suitable in a larger circuit and domain 
given time delays in propagating. However it may work very well in a small circuit. When a 
user is suspicious of someone’s public key, the user can disable it on his/her own public key 
ring (PGP -kd someone) at any time. The PGP allows users to change the level of trust to a 
public key at their own finger tip. 


In summary, PGP may not be as convenient as a centralised registry of public keys, such as 
the X.509 PKI trust model. However the key verification process can be distributed among 
peers. This web trust structure allows multiple signatures to vote a binding into validity, 
which may be more difficult to subvert [7], thus it provides some degree of trust. 


5. SDSI Trust Model 


In 1996, Ronald L. Rivest and Butler Lampson proposed the Simple Distributed Security 
Infrastructure (SDSI) [12], Later, the Simple Public-Key Infrastructure (SPKI) (refer section 
6) work group joined force with the SDSI group forming a joint effort to simplify the 
excessively complex" X.509-based global certificate hierarchies. SDSI defines a certificate 
structure and operating procedures for trust management. It supports both web and 
hierarchical trust models. It believes that simple and clear data structures can enhance the 
security requirements of a system. Its focus is on how to define a flexible data structure to 
perform trust management tasks in a flexible way. It claims that SDSI data structure makes it 
easier to define and implement access control list (ACL) and security policies, which is an 
authorisation function of security services. 


In SDSI, there are three type of certificates: membership certificates, identity certificates and 
name/value certificates. Certificates are defined as signed objects (eg. a principal, a group, a 


Which One is for You : Issues of Trust with Public Key Certificates 


87 



quote). It gives flexibility to allow the user to define any objects in its data structure, eg. 
"Signed messages are a special case of certificates" [12]. Identity-based certificates bind the 
public key and the individual (ie. person, principal). Group-membership certificates assert 
that a principal or principals are a member of its group. Name/value certificates use a local 
name (or a symbol), which is free-form text information about the person. 


SDSI introduced concepts of objects, such as principal, groups (ie. a set of principals forms a 
group, a group can be within a group), group-membership certificates, local name spaces. 

A notion of a group is used to specify authorisation by its principals. Each group has a name 
and a set of members. The name is local to the principal, who is the owner of the group. 


Principals have various "special status" which give them different authorities to perform 
certain tasks. Each principal can sign statements and requests on the same basis as any other 
principal. Each principal (ie. a public key, a digital signature verification key) is a 
"certification authority". S/He can issue and sign a certificate based on his/her own judgement 
or policies. The principal can also be the "value" of some name and can create its own local 
names (eg. nicknames, email addresses, account numbers, or any information) with which 
he/she can refer to other principals. The owner (ie. the principal) can change the group 
definition. The definition explicitly lists the group’s members, or defines the group in terms 
of other groups. Groups can be defined within a group (ie. nested expressions). A group has 
no a public key associated with. A member of the group (eg. a principal) can use its 
membership certificate to make a request for his/her (or its) group. 


5.1 SDSI’s Transitivity of Trust 


SDSI is a web of trust (egalitarian). The tmst in SDSI can be passed using principals as the 
"delegate authority". Delegate authority uses groups and a delegation certificate. A delegation 
certificate gives the delegatee the authority to sign certain types statements on behalf of the 
delegator. It can also be used to authorise a group. A group with "delegation certificate will 
be able to sign on behalf of the signing principal. The delegation certificates may be used as 
credentials by the group to show that their signatures are authorised by the principal. Several 
signers can co-sign SDSI objects (similar to PGP key signatures). 

Advantages: 


• Eases change and defines credentials for groups and principals by privileged principals; 

• Free-form test on an individual which provides some privacy for the individual, 

• Delegate certificates allow others (group or principal) to perform CA tasks; 

• SDSI’s multiply-signed requests are used to allow the signing principals combining their 
membership, so the co-signed requests (ie. special certificates) are more powerful then they 
are individually; 

• Simple functional structures (mainly using data structures), with no organisational structure 
required. 


88 


Which One is for You : Issues of Trust with Public Key Certificates 



Disadvantages: 


• No formal polices bind principals and groups, which may have difficulty in maintaining the 
scaleable trust; 

• It may be difficult to implement in large organisations where formal structures and policies 
are required. 


5.2 CRLs in SDSI 


SDSI does not have certificate revocation lists (CRLs). Instead, the "reconfirmation 
required" on signatures is used. Reconfirmation of the specified signature must be done by 
either the original signer (eg. principal), or by the reconfirmation principal. The requirement 
of periodic reconfirmation may also be stated on the certificate. The signer (eg. principal) can 
specify the reconfirmation period (eg. hourly or yearly) for a signature. Signatures contain a 
set of relevant certificates and have expiration dates. 


6. SPKI Trust-Model 


In 1996, the Simple Public-Key Infrastructure Work Groups set up clear objectives for more 
flexible certificate structure and operating procedures for trust management, which has 
similar objectives with the SDSI group. The SPKI Work Group stated: " the main purpose of 
an SPKI certificate is to authorise some action, give permission, grant a capability, etc". In 
other words, it tries to define credential-based certificates rather then identity-based 
certificates, like certificates in the X.509 PKI trust model. SPKI work now combines with 
SDSI data structures aiming to support a range of trust models. 


Although the SPKI is in the state of "work in progress", its objectives have been clearly stated 
in the form of an Internet Draft [1], Certificates used in SPKI should bind a meaningful 
attribute to a public key, eg. an SPKI certificate must be able to bind a key to a local name (ie. 
a person’s nickname). The attribute does not have to be a recognisable name. In other words, 
all certificates should be anonymous wherever possible. An SPKI certificate should be 
designed to discourage keyholders from sharing their private keys, and a certificate holder 
should be able to delegate permissions s/he acquires through the certificate (SDSI data 
structure). An SPKI certificate should be simple and useable in very constrained 
environments (eg. smartcards). An SPKI certificate should have a validity period, and support 
both positive and negative on-line validations via CRL. 


6.1 SPKI’s Transitivity of Trust 


Which One is for You : Issues of Trust with Public Key Certificates 89 



SPKI is intended to be a web-trusted model. It believes that people must be allowed to choose 
whom they should trust and reliable. It may also support hierarchical-trust model as stated in 
their objectives. Combining with SDSI model, the transitivity of trust may also use delegate 
certificates" and "groups" concepts. 


6.2 CRLs in SPKI 


In SPKI, a periodic revalidation will be required for certificates and public keys. The 
information of CRLs and Key Revocation Lists (KRL) can be generated periodically and 
broadcast or made available on request and given a limited Time-to-Live (TTL). Each 
certificate of SPKI has activity dates and conditions which indicate the certificate must be 
tested on-line or through other signed instruments. And any revalidation instrument of CRL 
and KRL must also have their own lifetime. 


7. Summary of Trust models and their Advantages and Disadvantages 


The ability to establish a trust relationship on the Internet will determine the rate of growth of 
Electronic Commerce. The digital signature certificate can be seen as a critical factor in trust 
management+ , which can provide scalable trust on the Internet. Trust management 
provides the infrastructure to achieve certain levels of confidence of trust* in a relationship(s). 
Scalable trust allows a constant level of trust to travel to a large number of nodes (or entities) 
and still be able to maintain its level of trust during its transitivity in a specific time frame. 
Different trust models focus on different aspects of trust relationship, such as parties, scales 
of trust relationships and trust infrastructures. They all have their place in the market of 
electronic commerce. The following table shows the various strengths and limitations of 
different trust models. 


90 


Which One is for You : Issues of Trust with Public Key Certificates 




X.509 

PGP 

SDSI & SPKI 

Advantages/ 

- industrial standard; 

- easy to obtain; 

- flexible in trust 
management using 

Strength 

- align with large organisation 
structures; 

- self-control; 

- changing the 

Delegate certs, 
principals; 


- CPS provides formal bondage 

level of Trust at 

- readiness of 


between CA and subordinate 
CAs; 

- many products support it. 

finger tips; 

implementation; 

- free-form infor 
on individuals. 

Disadvantages/ 

- complexity in organisational 

- technical 

- no formal 

Limitation 

and functional implementation; 

knowledge is 
required (eg. 

bondage of trust 
between 


- complicated in its policies 
(eg. CPS); 

revoke a key); 

- allow "traffic 

principals; 

- a member can 


- difficult to propagate CRL in 

analysis"; no 

make an enquiry 


a large domain. 

CRL. 

on behalf its 
group. 


(Table 2) 


As the development of trust management in different trust models is continuing to progress, it 
is very difficult to predict which trust models will merge and what new trust models will be 
proposed. However, one thing that is certain is that there is a need for PKI trust models and 
certification services to protect the authenticity of legitimate businesses and intellectual 
property. 


8. Conclusion 


Trust is a critical component in making business decisions on the Internet. A simple trust 
concept involves cryptographers, economists, policy makers and markets participating in the 
development of a trust relationship on the Internet. Different trust models take different 
approaches to establishing a trust relationship among business parties. The different models 
provide different structure of trust relationship, ie. hierarchical or web trust relationship. To 
choose the right tool for the right job is to know your own or your organisation’s needs, and 
the different capabilities and strengths of the various trust models. 


A CA also has different classes of certificates to provide different levels of trust. There is no 


Which One is for You : Issues of Trust with Public Key Certificates 


91 















100 per cent guarantee in a trust relationship, so there is a need to examine a CA’s 
Certification Practice Statement carefully, which requires a certain knowledge of CA’s 
policies and X.509 Public Key Infrastructure. These will require resources to educate your 
own users and staff and should be seen as a future investment for a company. 


9. References : 


[1] Carl M. Ellison, "SPKI Requirements" Internet - Draft, 11 March 1998, 
ftp://ftp.ietf.org/intemet-draft/draft-ietf-spki-cert-req-01.txt (Current version 18 July 1998). 

[2] Certification Authorities, http://www.qmw.ac.uk/~tl6345/ca.htm (Current version May 12 
1998). 

[3] Government Public Key Authority, at http://www.gpka.gov.au (Current version July 25 
1998). 

[4] Ian Taylar MBE MP, Minister for Science & Technology. "Licensing of Trusted Third 
Parties for the Provision of Encryption Services", http://jya.com/ukttp.htm (Current version 
18 July 1998). 

[5] ITU-T Recommendation X.509, "Information Technology - Open Systems 
Interconnection - the Directory : Authentication Framework". International 
Telecommunication Union, 1993. 

[6] ITU-T Recommendation X.509, "Information Technology - Open Systems 
Interconnection - the Directory : Authentication Framework". International 
Telecommunication Union, 1996. 

[7] Joseph M. Reagle Jr., "Trust in electronic Markets: the convergence of Cryptographers 
and Economists", http://rpcp.mit.edu/~reagle/commerce/commerce.htm (Current version 

1996). 

[8] Marc Branchaud, "A Survey of Public-Key Infrastructures", 
http://www.xcert.com/~marcnarc/PKI/thesis/ (Current version July 1998). 

[9] PKI Technical Working Group (PKI-TWG), http://csrc.ncsl.gov/pki/twg/twgindex.html 
(current version February 14 1998). 

[10] Public Key Infrastructure Study Final Report. Produced by the MITRE Corporation for 
NIST. April 1994, http://www.xcert.eom/~marcnarc/PKI/#Xstandards (Current version Augus 
1998). 

[11] PKIX Working Group, "Internet X.509 Public Key Infrastructure : Certificate Policy and 
Certification Practices Framework", April 25, 1998. 

[12] Ronald L. Rivest and Butler Lampson, "SDSI - A Simple Distributed Security 
Infrastructure", http://theory.lcs.mit.edu/~rivest/sdsilO.html. 


92 


Which One is for You : Issues of Trust with Public Key Certificates 



[13] Simson Garfinkel, "PGP : Pretty Good Privacy", O’Reilly & Associates, Inc. First 
Edition, January 1995. 

[14] Stephen Frede, "Certificates - Who’s There ?", Systems, p68, February 1998. 

[15] "Strategies for the Implementation of a Public Key Authentication Framework (PKAF) 
in Australia", http://www.acs.org.au/president/1996/epubs/pkaf.htm (Current version 26 
March 1998). 

[16] VeriSign, "Digital IDs Introduction", http://digitalid.verisign.eom/id_intro.htm#ID_use 
(current version March 12 1998). 

[17] Warwick Ford and Michael S. Baum, "Secure Electronic Commerce," Prentice-Hall, 
1997. 

[18] W. E. Burr (Editor), "Federal Public Key Infrastructure (PKI) Technical Specifications 
(Version 1) Part C: Concept of Operations", Part of the US NIST PKI program, 
http://csrc.ncsl.nist.gov/pki/pkiconl3.ps (Current version 18 July 1998). 

[19] Yinan Yang," Security Mechanisms in Electronic Commerce", 
http://www.cs.adfa.edu.au/~yany97/report97.html (Current version 2 December 1997). 


Acknowledgements 


We would like to thank Ed Lewis, Bob McKay and Warwick Ford for their generous 
contribution to the work. This paper would not have been possible without their help and 
encouragement. 


Which One is for You : Issues of Trust with Public Key Certificates 


93 



I 




























94 


Which One is for You : Issues of Trust with Public Key Certificates 




Torn Money and the PGP Web of Trust 


Jeanette McLeod, Greg Rose 

Q UALCOMM Australia 
Suite 410, Birkenhead Point 
Drummoyne,2047 
NSW 

(02) 9181 4851 

{jmcleod, ggr}@ qualcomm. com 


Abstract 


Legitimacy and trust are perhaps the most complicated aspects of PGP. The trust 
model used by PGP assumes that trust starts with bilateral arrangements (key 
signing) and grows organically to produce a decentralised “web” simply known as 
the “Web of Trust”. Decentralisation is advantageous in that it foregoes the need for 
any central authority, yet the model as it stands does not scale well in a large open 
community. Tom Money has been designed as an authentication service primarily to 
facilitate the introduction of new users to the web of trust and also as a means of 
enhancing connectivity within the existing web. 

Tom Money is a follow-up to the AUUG’s PGP Key Signing Service, which in 
essence, seeks to maintain and support PGP’s decentralised trust model. 


Background 

Pretty Good Privacy (or PGP) is a publicly, and internationally, available privacy program. 
Essentially, it uses public key cryptographic techniques to allow messages to be exchanged 
between people across public networks, while protecting the privacy of the contents and 
guaranteeing authenticity of the sender. 

Traditionally, one of the problems with cryptographic systems was "key management". The key is 
the secret value that allows information to be encoded and/or decoded. Prior to the development of 
public key cryptography, the key had to be securely exchanged between parties before they could 
communicate. Public key systems are designed such that two separate keys are used, one of which 
can be made public (like a telephone number) while the other must be kept secure by the owner 
(like the telephone itself). In light of this development, it would appear that the problem of key 
management has been solved. 


Tom Money and the PGP Web of Trust 


95 



Unfortunately this is not the case. Key management is undeniably easier using public key systems, 
but the question now becomes one of authentication. How do you know, for sure, that the person 
you are sending a secret message to is really the person they claim to be? I could easily get a 
telephone connected in another name, and sit back waiting for phone calls intended for another 
person of that name. 

One solution to the problem is to introduce the notion of "trusted parties", that is, people who you 
trust to introduce (and therefore authenticate) other parties to you. Using the telephone analogy, 
you would only say secret things on the phone if someone you trust had given you the telephone 
number, not if you had just looked it up in the phone book. This is what the PGP documentation 
refers to as the "Web of Trust". Its structure is likened to that of a web as each party involved, 
trusted by you, can introduce other parties whom you may or may not already know. 

Another possible solution is the use of Certification Authorities, thereby enforcing a hierarchical 
structure on the Web of Trust. What this means is that any public key you acquire must now come 
with a list of certificates. For example, J. Smith’s public key might come with a certificate from 
Widgets, Inc., stating that he works for them. In order to establish their authenticity, Widgets, Inc. 
would also require a certificate from someone asserting that they are a Delaware Corporation. To 
authenticate this, the state of Delaware would need a certificate to verify it was really what it 
claimed to be, and so on. Eventually the regression must stop, with a certificate being issued by 
some omnipresent authority (which, at the moment, is RSA Data Security Inc.). 

Both schemes have flaws. The major problem with the Web of Trust is that it has to be big and well 
connected before it becomes useful, while the Certification Authority approach assumes the sort of 
control which is often the reason the parties wanted to communicate privately in the first place. 

(The above is intended to be an absolutely minimal explanation of the concepts of public key 
cryptography and key management. If the concepts are not yet clear, the PGP documentation, 
which you should eventually read, explains it in more detail.) 


Torn Money and the PGP Key Signing Service. 

In an attempt to expand the Web of Trust, AUUG set up a PGP Key Signing Service in which it 
acted as an introducer for PGP keys. By virtue of the conferences it held, AUUG was in a position 
to physically meet with people, verify their identity and then issue key signatures attesting their 
identity. The high public profile of the organisation meant that key verification wasn’t difficult, and 
as the procedures for the key signing were made public, it was easy to decide what level of trust to 
place in the authenticity of a key signed by AUUG. However, the service was beginning to 
introduce a hierarchy into the Web of Trust with AUUG inadvertently taking on the role of a 
Certification Authority. The implications of this brought the service to an end, as it was no longer 
conforming the PGP trust model. However, the service had one very innovative feature; it did not 
require people to have their key ready in advance. 


96 


Torn Money and the PGP Web of Trust 



"Tom Money" has been designed in the same vein as the Key Signing Service, with its main aim to 
facilitate PGP key signing. While this new service avoids the problems that the Key Sig nin g 
Service was beginning to encounter, it manages to preserve the favourable features of the old 
service - namely, it still allows the verification of those who have not prepared their key in 
advance. The inspiration behind Tom Money comes from old spy films, where the possession of a 
significant half of a tom banknote established a person's identity. The beauty of such a concept is 
that it no longer requires an "authority" such as AUUG to oversee the key signing, the notion of the 
"tom" banknote means that any two parties can be involved and still effectively identify each other 
at a later date. 


Introduction to Torn Money 

PGP signing can occur whenever one interested party meets with another (conferences such as 
those hosted by AUUG or USENIX are a common forum for such an activity). People wishing to 
have their keys signed provide acceptable proof of identity together with their PGP fingerprint to 
the person or persons they wish to have sign their key. Their public key can then later be retrieved 
for signing from a key server or sent via email, with the supplied fingerprint providing verification 
of the key’s authenticity. However, this kind of key signing is only meaningful if the interested 
parties already have PGP keys generated and their fingerprints with them. This is not always the 
case. 

Tom Money side steps this issue by providing a way in which interested parties can successfully 
identify each other at a later date. Conceptually, this means that upon meeting, interested parties 
will establish their identities as before and then obtain a "secret". The possession of this secret is 
what enables secure future communication. With this in place, those who are unprepared now have 
the opportunity to create a PGP key at some later time and then communicate the required details to 
those parties from which they obtained their secret. By revealing the secret they were given, they 
are able to prove their identity, thus validating their key for signing. 

While conceptually this scheme makes it viable for two unprepared parties to trade details, Torn 
Money's primary function is to introduce newcomers to the web of trust and enhance connectivity. 
It is therefore essential that one of the parties involved already belong to the web of trust so that 
their signature will act to initiate a newcomer. This person, call them an "expert user", will 
effectively become the "owner" of the tom money. It is their responsibility to generate and 
distribute the tom money, but they are in no way to be considered an "authority". To such effect, 

the newcomer is well advised to participate in the tom money scheme with as many expert users as 
they can. 


What Exactly is Tom Money? 

Tom Money borrows its form from that of a banknote. It is simply a piece of paper containing pairs 
ot related secrets (which function something like a banknote's serial number). Upon generating a 
piece of tom money, the expert user will be required to enter their name, email address, PGP Key 


Torn Money and the PGP Web of Trust 


97 



ID and fingerprint, and the number of new comers they wish to sign keys for. This information is 
required to facilitate future communication between the owner of the tom money and the recipients. 

The generated piece of tom money will contain the owner’s name and PGP fingerprint at the top, as 
well as a sentence comprising eight, four letter words - their secret. Next there is a blank table of n 
rows, where n is the number of newcomer keys the owner elected to sign. This is left blank so that 
the expert user can note down the name, email address and identification information (optional) of 
anyone wishing to participate. Lastly, the remainder of the document is divided into n sections, 
designed to be "tom" off by the owner and distributed among the participants. Each section 
contains the name of the expert user, their email address, PGP key ID and fingerprint, the web 
address of the Tom Money web site, and eight, four letter words - the participant's secret. (See 
Appendix 1 for a sample of Tom Money). 

After verifying a newcomer's identify, the expert user notes down their details in a row in the table 
and gives them the tear-off section corresponding to their row number. This piece of Tom Money 
should be kept safe as it is now the only existing link between the expert user's identification 
inf ormation and the new user. For security reasons it is also vital that no one else has access to the 
Tom Money as it contains the new user's secret. 

Once the newcomer has generated their own PGP key, they should send email to the expert user(s) 
for which they have Tom Money. To be secure, this email should be encrypted using the expert 
user’s PGP key (obtained from either the expert user or a key server and verified with the 
fingerprint on the newcomers half of the tom money), and signed using the newcomer's PGP key in 
order to prove ownership. The content of this mail should comprise the new user s PGP public key 
itself and the secret eight, four-letter words from the Tom Money. 

Upon receiving this message, the expert user must verify the secret they have been sent before 
signing the new key. The new user's secret is derived from a combination of the expert user s secret, 
their row number in the table on the expert's half of the Tom Money, and the expert users name. 
Thus the expert user must provide these details exactly to the Tom Money verification program, in 
order to authenticate the contents of the email message. Once this has been achieved, the expert 
user can sign the new user's key and return it. 

Note: The use of Tom Money is in no way restricted to newcomer/expert user pairs. As our overall 
objective is to increase connectivity within the web of trust, established users of PGP who may 
arrive at a gathering ill equipped for key signing are also encouraged to use Tom Money. 


Tom Money Generation and Verification 

Tom Money can be generated in two ways: either by using the web interface at the USENIX web 
site, or by downloading the source for it and generating it on your own computer. The same option 
applies to the verification part of the procedure — a web interface is available, and the source for it 
comes as part of the download for the Tom Money program. 


98 


Torn Money and the PGP Web of Trust 



User Support 


Once the Tom Money project is complete, full documentation and procedures for use will be made 
available from the USENLX web site. At this point in time we envision the users of Tom Money to 
comprise three distinct groups: new users of PGP seeking connection to the Web of Trust, expert 
users willing to certify new users, and people wishing to advertise gatherings (e.g. Conferences, 
Seminars, etc) where PGP key signing or exchange of Tom Money can occur. As such, a series of 
pages will be dedicated to each group. 


Newcomers Instruction Page: 

In support of new users of PGP and Tom Money, a series of help pages will be made available and 
the web address included on their piece of Tom Money. These pages will include inf ormation on 
PGP and tmst, the function of Tom Money and its usage, links to key servers, and the details of any 
gatherings at which the exchange of Tom Money can occur. 


Expert Users Instruction and Generation Page: 

A set of pages will also be aimed at established users of PGP who wish to generate their own pieces 
of Tom Money. These pages will include information on the function of Tom Money and its usage 
as applicable to an expert user, as well as details on how to generate Tom Money and how to verify 
responses from recipients. The date and location of any gatherings at which the exchange of Tom 
Money can occur will be made available, and expert users intending to engage in key signing (and 
specifically, the distribution of Tom Money) will be given the option to register their attendance at 
specific functions. 


Organisers Page: 

As part of the Tom Money key signing service, support will be given to functions and gatherings at 
which key signing can occur. This support will be provided through a series of pages on the 
USENIX web site which will allow organisers to register their functions as forums for PGP key 
signing and the distribution of Tom Money. The time, date and location of the function will be 
made publicly available so that expert users may indicate their attendance and hence their 
willingness to certify new users, and newcomers seeking an introduction to the Web of Tmst may 
see when they next have the opportunity to be certified. 

All feedback, questions and concerns regarding Tom Money can be directed to Greg Rose and/or 
Jeanette McLeod. Over time appropriate FAQ’s will be compiled and posted to the web site and 
Tom Money will be revised to better meet user needs. 


Torn Money and the PGP Web of Trust 


99 



Concluding Remarks 


Successful world wide use of PGP depends on a wide-spread, well connected Web of Trust. Tom 
Money has been designed with this goal in mind. The project is due for completion sometime in 
October, and the web pages discussed in this document will be made available from the USENIX 
web site http://www.usenix.org. In the mean time any feedback on the project is most welcome, 
and Tom Money is available for trial usage on request. 


100 


Torn Money and the PGP Web of Trust 



Appendix 1 





Torn Money and the PGP Web of Trust 


101 



TORN MONEY FOR Jeanette McLeod 


Key Fingerprint: IB DE 98 8F 08 49 05 4B 82 56 DD DA 67 4E FD BO 


Verification: real yawn ntis warm winy peel date rate 


No. | Name | Email Address | Id Information (optional) 


1 

O 1 

1 

1 

1 

1 

1 

1 

1 

1 


1 

1 

1 

1 

1 

1 

1 


2 

1 

1 

1 

1 

1 

1 


1 

OJ 1 

1 

1 

1 

1 

1 

1 

1 


1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 

1 


0. Name: Jeanette McLeod 

Email: jmcleod@qualcomm.com 

Key ID: 2C500945 

Public Key Fingerprint: 

IB DE 98 8F C8 49 05 4B 82 56 DD DA 67 4E FD BO 
Verification: quit list burg mesh dare jane afro grad 

Help: http://www.USENIX.org/tornmoney/newcomer.html 


102 


Torn Money and the PGP Web of Trust 



























1. Name: 


Jeanette McLeod 


Email: jmcleod@qualcomm.com 


Key ID: 2C500945 


Public Key Fingerprint: 


IB DE 98 8F C8 49 05 4B 82 56 DD DA 67 4E FD BO 


Verification: marc oral tick voss mimi cosh toby pure 


Help: 


http://www.USENIX.org/tornmoney/newcomer.html 




2. Name: 



Jeanette McLeod 


Email: jmcleod@qualcomm.com 


Key ID: 2C500945 


Public Key Fingerprint: 


IB DE 98 8F C8 49 05 4B 82 56 DD DA 67 4E FD BO 


Verification: boat pike amok fast abbe told held 


Help: 


http://www.USENIX.org/tornmoney/newcomer.html 


Tom Money and the PGP Web of Trust 


103 




























3. Name: 


Jeanette McLeod 


Email: jmcleod@qualcomm.com 

Key ID: 2C500945 

Public Key Fingerprint: 

IB DE 98 8F C8 49 05 4B 82 56 DD DA 67 4E FD BO 
Verification: scot cord iris lure doff fuel lazy quad 

Help: http://www.USENIX.org/tornmoney/newcomer.html 


4. Name: Jeanette McLeod 

Email: jmcleod@qualcoitim.com 

Key ID: 2C500945 

Public Key Fingerprint: 

IB DE 98 8F C8 49 05 4B 82 56 DD DA 67 4E FD BO 

Verification: anne duke lamp mock blat lark gawk lair 

Help . http:/ /WWW .USENIX.org/tornmoney/newcomer.html 


104 


Torn Money and the PGP Web of Trust 
























Unified Authentication Using PAM 


Anthony Baxter 

Development Manager, connect.com.au Pty. Ltd. 

<arb@connect.com.au> 


A common concern for many organisations today is the need to provide a single unified mech¬ 
anism for authenticating users to the many systems and applications that require that service. 
This talk will discuss Pluggable Authentication Modules, which allow these systems and appli¬ 
cations to be separated from the underlying authentication service. The selection of which 
authentication service to use can be made without having to change the application and can be 
configured by the administrator of the computer system to meet organisational requirements. 


The Problem 

Once upon a time, everyone used the /etc/passwd file on a Unix machine to store user names 
and user passwords, and any applications that needed to authenticate a user would use this 
database. Life was simple, life was good. 

In the years since this initial scheme was devised, the requirements for an authentication serv¬ 
ice have expanded considerably: 

• There are now a wide variety of schemes for storing user information, including network 
databases (such as Sun’s NIS / NIS+, MIT Kerberos, LDAP or RADIUS) which means that 
the user information may be stored on a different machine entirely. 

• The original username/password authentication is still in use, but at many sites has been 
enhanced or replaced by some form of one-time authentication, token/smartcard based 
authentication, or public key cryptography. 

• It s extremely useful to be able to control the authentication process in a more flexible man¬ 
ner, without having to rebuild all the applications on the computer system each time you 
wish to change a setting, or editing a different configuration file for each application. An 
example of this might be to prevent ordinary users logging into a system at all during some 
scheduled maintenance window, while still allowing administrators to use the system. 

Adding a new authentication service into an existing computer system is a tricky and error- 
prone task - making it work correctly with each of the applications on the system, and each 
of the existing authentication services can be a considerable amount of work. 

A (partial) list of some of the applications that typically require authentication services and 
some of the typical authentication mechanisms is shown in Table 1. As can be seen from this 


Unified Authentication Using PAM 


105 



TABLE 1. Services and Mechanisms 


Services 

POP/IMAP (email) 

ftp 

web servers 
file sharing 
login 
databases 


Authentication 

Mechanisms 

username / password 

.rhosts files 

RADIUS 

Kerberos 

hardware tokens 

public key (e.g. ssh, SSL) 

S/Key (one-time passwords) 


table, the process of offering all authentication mechanisms to all systems would be an enor¬ 
mous task. 


In addition, it is useful to be able to separate the process of authentication (“are you who you 
claim to be?”) and authorisation (“now that I know who you are, are you allowed to use this 
service at this time?”). This simplifies the process of offering limited services to particular 
classes of users, and in general offers the administrators of the system a great deal more flexi¬ 
bility. (An example of this might be a web application where customers are allowed to authen¬ 
ticate themselves to view web pages, while staff are allowed to log in to the machine and 
update the web content.) 


A Solution 

As a solution to these problems, engineers at Sun Microsystems developed Pluggable Authen¬ 
tication Modules (PAM). This sits in between the applications and the authentication mecha¬ 
nisms. Each application is modified to call the PAM library functions to obtain authentication, 
authorisation, and session management services. Each authentication mechanism has wrap¬ 
pers around it to allow it to be called by the PAM library. To add a new authentication mecha¬ 
nism, it’s simply a matter of building a single module for it, and installing it in the appropriate 
location. You can then switch applications over to using the new mechanism by modifying the 
PAM configuration on the system. 

So instead of n applications needing to work with m authentication libraries, we have each 
application needing to be modified once, and each authentication routine needing to be modi¬ 
fied once. We then end up with something similar to what is shown in Table 2, with each of the 
applications being modified to call the standard PAM library, and each authentication mecha¬ 
nism being modified so that it can be called by the PAM library. 


106 


Unified Authentication Using PAM 



TABLE 2. Services and Mechanisms, with PAM 


Services 

(call PAM library) 

POP/IMAP (email) 
ftp 

web servers PAM 

file sharing 

login 

databases 


Authentication Mechanisms 

(called by PAM library) 

username / password 

.rhosts files 

RADIUS 

Kerberos 

hardware tokens 

public key (e.g. ssh, SSL) 

S/Key (one-time passwords) 


The Details 

There are four tasks handled by the PAM library: 

• Authentication Management 

• Account Management 

• Session Management 

• Password Management 


Authentication Management 


An application needing to authenticate a user calls into the PAM library to do this. The library 
looks up the application in the PAM configuration, and calls into the appropriate authentica¬ 
tion mechanism s module. If this module needs to prompt for any additional information (for 
example, a passphrase, or a password) it calls back to a “conversation function” in the applica¬ 
tion. An example will make this clearer. In this example, we have an application Login, and it 

should force all users to use the Frobozz SmartCard system. The user ABaxter is attempting to 
connect to this service. 


1. Login Application initialises the PAM library, including setting up a conversation function (a way to 

prompt the user for input) 3 

2 . Login Application calls the PAM library, asking it to authenticate user ABaxter 

3. PAM library finds Login in it’s configuration, and determines that it should use the Frobozz SmartCard 
to pertorm authentication. 

4. PAM library loads the Frobozz SmartCard module, and calls the authentication function. 

5. The Frobozz module prepares a challenge, and passes it back to the library, so the library can pass it 

back to the application. F 

6. The library sends the challenge back to the application’s conversation function, as a “User Prompt”. 

' ^ gin P rom P ts the user Wlth the challenge, in this case maybe it prints it on the user’s text screen 

8. The user types in a response, and this is sent back through the library to the authentication module. 

, . , ' The authentication module accepts or rejects the authentication, and passes this back to the library 

which sends the response back to the application. 


The authentication module can call the conversation mechanism as many times as it needs to 
get information. Note also that it’s up to the application to display the prompt to the user - the 
authentication module does not need to know how to display prompts in an XI1 window or to 
a web page, or whatever is needed. 


Unified Authentication Using PAM 


107 



Account Management 

The account management part of PAM is generally used to choose whether to allow a service 
for a particular user, or at a particular time, or some other constraint. For example, the user 
“root” might only be allowed to log in on the console of a machine, or game playing users 
might only be allowed to connect when the system is below a particular load level. 

Session Management 

The session management section of PAM deals with logging and accounting for the session. 
This allows the administrator to choose how to log the user connection attempt. 

Password Management 

Finally, the password management section of PAM allows a user to work through an applica¬ 
tion to change their authentication token - usually a password, or some other piece of informa¬ 
tion. 

The Configuration File 

Most of the flexibility of PAM comes from it’s configuration file. It is here where an adminis¬ 
trator sets the policy for which application uses which module or modules. 

The configuration file consists of a series of lines, in the following form: 

service-name module-type control-flag module-path module-arguments 

The service-name is the name of the application, for example “ftp”, “login”, or “http”. The 
special service-name “OTHER” is used as a default, if no application-specific entry can be 
found in the configuration. 

The module-type is one of “auth”, “account”, “session” or “password”. You can specify multi¬ 
ple lines with the same service-name and module-type - this is known as stacking, and is an 
incredibly powerful tool. Stacking will be covered in more detail, shortly. 

The module-path specifies the location of the module for this application (typically these are 
stored in the /usr/lib/security directory, but there’s no reason they can’t go somewhere else). 

Module-arguments is just an (optional) list of additional arguments that are passed through to 
the module. 

Control-flags specify how the module should react to success or failure of the current opera¬ 
tion. This is a key part of PAM’s flexibility. The four settings for this flag are: 

required-, this indicates that the success of this module is required for the operation to succeed. 
The application will not be notified of the failure of the operation until all other modules have 

been tried. 

requisite-, like required, however, in the case that such a module returns a failure, control is 
directly returned to the application. 


108 


Unified Authentication Using PAM 



sufficient ; success from this module is sufficient to satisfy the PAM library that this operation 
is successful. Assuming that no previous required module has failed, no more “stacked” mod¬ 
ules of this type are invoked, and success response code is returned to the application. A fail¬ 
ure of this module does not cause the PAM library to abort - instead it will move on to the next 
module of this type. 


optional, as its name suggests, this module is not critical to the success or failure of the user's 
application for service. However, if no previous modules have resulted in a success or failure, 
then this module will decide on the response to the application. 


These flags allow the administrator considerable flexibility in how the system is configured. 
Some examples follow. 


# default; everything not explicitly permitted is denied. 

# 

OTHER auth required 

OTHER account required 

OTHER password required 

OTHER session required 


/usr/lib/security/pam_deny.so 
/usr/lib/security/pam_deny.so 
/usr/lib/security/pam_deny.so 
/usr/lib/security/pam_deny.so 


This is a good set of defaults - everything that is not explicitly permitted previously in the file 
will be denied. This is a good security policy, and prevents problems from a misconfiguration. 
The special module pam_deny. so simply denies all requests (there is a corresponding 
pam_permit. so which permits all requests. (It is worth noting, however, that 
pam_deny. so does no logging of failures). 


# 

# imap; 

# then 

# 

imap 

imap 

imap 

imap 

imap 


first check for passwords in the local passwd file, 
look in RADIUS. Log everything via RADIUS. 


auth sufficient 

auth required 

account required 
session required 
password required 


/usr/lib/security/pam_unix.so 
/usr/lib/security/pam_radius.so 
/usr/lib/security/pam_permit.so 
/usr/lib/security/pam_radius.so 
/usr/lib/security/pam_deny.so 


This specifies that for IMAP authentication, first look in the Unix password file. If that suc¬ 
ceeds, return success to the application. If it fails, then try using the RADIUS module, and 
return the results (success or failure) based on what the RADIUS module returns. For’the 
Account Management part of the application, we simply let everything through. An alternative 
might be to have a small module that simply checks for the existence of/etc/no Jogin or simi¬ 
lar, and denies logins if that file is present. To log the session start/session end, we use the 
RADIUS module again, to send accounting details to our standard RADIUS server. Finally, 
any password changing attempts should just be denied. 


Module Availability 

Under Solaris’ standard installation, there are only a few modules supplied - enough to pro¬ 
vide the standard login functionality (.rhosts files, usemame/passwd), as well as a dial_auth 
module for authenticating dialin users separately to normal users. 

Redhat Linux, and any other systems which ship something based on Linux-PAM, come with 
a much wider range of modules. 


Unified Authentication Using PAM 


109 



A partial list of available modules: 

Unix Password Files; RADIUS; LDAP; Kerberos; S/Key (aka OPIE); Netware (including 
NDS); SMB (Windows NT); cracklib (for checking passwords before allowing them to be 
changed); several different smartcards; and a whole host of different small modules for check¬ 
ing various small things before allowing access. 


PAM Availability 

PAM is supported partially in Solaris 2.5.1 and fully in 2.6. HP/UX will be supporting PAM, 
and it is part of the X/Open Single Sign-On specification, which should mean it appears on 
other platforms over time. In addition, there is a Open Source® version of PAM (called 
Linux-PAM, although it is portable to other Unices) which is supplied on some of the larger 
Linux distributions. Linux-PAM has also extended the original specification (in a backwards- 
compatible way) and the maintainers are attempting to push the changes back into the original 

specification. 1 

There are a wide variety of modules available from the Linux-PAM website, and many ven¬ 
dors of authentication mechanisms are now making PAM modules available. For applications, 
all of Solaris’ systems should work through PAM, and it’s reasonable to assume the same of 
HP/UX. For those of a more do-it-yourself mindset, Redhat have ported almost all the applica¬ 
tions you could ever want to PAM, and have made the source available 2 from their web and ftp 
sites. There are also links to applications at the Linux-PAM website. If the vendor of your 
application or authentication method has not yet provided PAM support, feel free to beat them 
up with facts - hopefully they will see 

PAM is not available for Windows NT, Novell, or other non-Unix systems. 

PAM in the Real World (aka PAM at Connect) 

At Connect.com.au, we use PAM, with an underlying RADIUS module, to authenticate dialup 
users using POP, IMAP and FTP. We also use it to authenticate users accessing their account 
maintenance web pages, through the Apache mod_auth_pam module. Using PAM has 
allowed us to test and deploy new versions of the underlying authentication code onto live sys¬ 
tems, without any customer-visible outages (this would have been much harder without 
PAM). Further, whenever a useful successor to RADIUS comes along 3 we should be able to 
deploy it without having to rewrite all of our existing applications. 


1. See the Appendix for some of the differences between “standard PAM and Linux-PAM 

2. It’s in the slightly annoying Redhat “rpm” format, but you can find an rp™2cpio converter to help you extract 
the contents, on Redhat’s website. 

3. For a rant on the problems with RADIUS, contact the author. 


110 


Unified Authentication Using PAM 



Conclusion 


The PAM system offers a great deal of flexibility and control to the administrators of compu¬ 
ter systems. At the same time, it greatly reduces the work required by vendors of applications 
and of authentication systems to support a new authentication method. 

The primary problem holding back PAM today is support from vendors. In the operating sys¬ 
tem field Solaris, HP/UX and Redhat Linux are the only major vendors which ship with PAM 
support. It would be interesting to find whether a system similar to PAM could be adapted to 
Windows NT - although this would probably require more knowledge of NT internals than is 
currently publicly available. 


Links and Further Reading 

http://linux.kernel.org/pub/linux/libs/pam/ 

The Linux-PAM home. From here, you can read the original PAM specification, look at the 
lists of available modules and applications, and find links to other relevant websites, 
http://www.sun.com/solaris/pam/ 

Sun’s page on PAM. 
http://www.redhat.com/ 

Redhat’s page on PAM. 

http://www.redhat.com/linux-info/pam/ 


Appendix: Linux-PAM Additions to PAM. 

Linux-PAM added the ability to pass binary (non-text) data around in conversation functions. 
This was needed to support some of the public key software out there (e.g. ssh). 

As well as using a single large /etc/pam.conf file, Linux-PAM allows you to put each applica¬ 
tion s configuration in a separate file in the directory /etc/pam.d. 

The syntax for the control-flag section of the configuration file has been greatly extended. The 

new syntax is very powerful, but still somewhat experimental. Fortunately, all the old flags 
still work. 6 


About the author 

Anthony Baxter is the Development Manager in the Internet Engineering group of con- 
nect.com.au. Previously he has been a software developer at connect, and once upon a time he 
used to be a system administrator and then a system manager (but not at connect). He has been 

working with Unix systems and the Internet since around 1990 (it’s hard to keep track some¬ 
times). r 

Open Source ®is a trademark of Software in the Public Interest 


Unified Authentication Using PAM 


111 











































112 


Unified Authentication Using PAM 


Cisco System Copyright 




IPSec Encryption 

Phillip Yialeloglou 
Cisco Systems Australia 


phil@cisco.com 


Agenda 


• Threats 

• Applications 

• Basic technologies 

• Details 

• Industry update 




Loss of Data Integrity 


Deposit $ 
1000 




Deposit S 100 


m \i^ tiB-i 

B- 


Customer S 


Identity Spoofing 


^ I’m Bob^\—. 

Send Me all Corporate' 
S. Correspondence 
Cisco 


q.l 



IPsec Encryption 


113 










Cisco System Copyright 




Denial of Service 


sr 



Clio 8mm 


IPSec Applications 


Clio 




What is IPSec? 


Proposed Internet standard for IP- 
layer encryption with IPv4 and IPv6 

Transparently secures transmitted 
data 

Allows implementation of Global 
Networked Business 


Cno Smm 



Enable Mobile Users 



• Encrypts traffic from remote sites to 
confidential servers 

• Travelers can access the network as securely 

as they would in the office cuttUmii 


Protect the WAN 



Allowing networks to expand through the 
Internet or over leased lines 


ClIO tTITEII 


114 


IPsec Encryption 















Cisco System Copyright 



• Encrypted traffic between partners over Internet 

• Unencrypted traffic within your network , 


IPSec Technologies 


Encryption 
Device authentication 
Key exchange 


Encryption Layers 

I ^_Application-Layer 


Application 
Layers (5-7) 


Transport/ 
Network 
Layers (3-4) 


Network-Layer 


Link/Physical 

Layers (1-2) 

H—H 

1- V 1- V 

H — H 


Link-Layer 

Encryption 

Link-Layer , , 

Encryption 


Network-layer Encryption 


Joe’s PC to HR Server 



• Traffic encrypted on a flow-by-flow basis 
between specific hosts or subnets 

• Media and Interface Independent 

• Transparent to intermediate network devices 

_ Cun 1mm 

• Topology Independent WffM 


Keeping Track of the Keys 


Each device has three keys: 

1. A private key that Is kept secret and never shared. Used to 
sign messages 

2. A public key that Is shared. Used by others to verify a 
signature 

3. A shared secret key that is used to encrypt data using a 
symmetric encryption algorithm (e.g., DES) Cnti Irttin 


IPsec Encryption 


115 










Cisco System Copyright 



Data Confidentiality 




Bob's Public Key 


Bob’s Private Key 


• Alice gets Bob’s public key 

• Alice encrypts message with Bob’s public key 

• Bob decrypts using his private key 


Sender Authentication 




Alice's Private Key 


Alice’s Public Key 


Alice encrypts message with her private key 

Bob gets Alice’s public key 

Bob decrypts using Alice’s public key C|[<| 



Signature Generation 


One-way function. Easy to 
produce hash from message, 
“Impossible" to produce 
message from hash 


T Hash of Message »74hr7sh7040236fw 

_ 4 _ 

Sign Hash with Private Key ^TnKiTyto^MS 7 

Signature = “Encrypted" „ , 

Hash of Message ^ 


Signature V erificatio n 


Decrypt the 
Received 
Signature 




j Decrypt Using 
Alice's Public Key 

i * 

Hash of 
Message 


Message with 
Appended 
Signature 


♦ 

Hash Message 

If Hashes are 

Equal, Slgnature^r^ C 

Is Authentic 


Re-Hash the 

Received 

Message 



116 


IPsec Encryption 






Cisco System Copyright 



IPsec Encryption 


117 














Cisco System Copyright 



I What IPSec Provides 

• Authentication of IP datagram 

• Confidentiality of IP datagram 

• Anti-replay protection 

• Secure tunneling 


Overall Architecture (RFC 1825) 

• Framework for security protocols 
to provide: 

Data integrity 

Data authentication 

Data confidentiality 

Security association management 

Key management 

Ciiii Stum 




Authentication Header (RFC 1826) 


All Data In Clear Text 


• Data integrity—no twiddling of bits 

• Origin authentication—definitely 
came from Router 

• Uses keyed-hash mechanism 

• Does NOT provide confidentiality 


IPsec Encryption 












Cisco System Copyright 




Encapsulating Security 
Payload (RFC 1827) 


All Data-Encrypted 


• Confidentiality 

• Data origin authentication 

• Data integrity 

• Replay protection (optional) 


IPSec Encapsulating Security 
Payload Header (ESP) 

• ESP header is 
prepended to 
IP datagram 

• Confidentiality 
through encryption 
of IP datagram 

• integrity through 
keyed hash 
function 



Security Association (SA) 


Insecure Channel 


Agreement between two entities 
on method to communicate securely 

Unidirectional — two way 
communication consists of two SAs 

Cuci trtTEH 


I 


Security Associations Enable 
Your Chosen Policy 


Tunnel-Mode 
AH-HMAC-SHA 
PFS 50 _ 


Transport-Mode 
.ESP-DES-HMAC-MD5 
PFS 15 



IPsec Encryption 


119 

















Cisco System Copyright 


M 


IPSec Security Association (SA) 


Destination Address 
Security Parameter Index (SPI) 



Additional SA Attributes 
(e.g., lifetime) 



Cllll Itltlll 


I IPSec Processing: 
^ Input and Output 



IISAKMP + Oakley 

• Negotiates policy to protect 
communication 

• Authenticated Diffie-Hellman Key 
Exchange 

• Negotiates (possibly multiple) 
security associations for IPSec 


Ciici imm 




IlSAKMP+Oakley Authentication 


•Signatures (RSA or DSS) 

Diffie-Hellman secret, identity, hashed together and signed 
Non-repudiable proof of communication 

•Encrypted nonces (RSA only) 

Pseudo-rBndom nonce encrypted in other parly’s public key 
Nonces, Dlffie-Hellmsn secret. Identities hashed 
Repudiable, deniable exchange 

•Pre-shared key 


Key is agreed-upon out-of-band 

Key, Diffie-Hellman secret, identities hashed 

Limited applicability 


Ciici Syiteii 



I How IPSec Uses 
^ ISAKMP+Oakley 


1. Outbound packet from 
Alice to Bob. No SA 


4. Packet is sent from Alice to 
Bob protected by IPsec SA 


S 

Alice's 

Laptop 


* 


& 

Bob’s 

Laptop 


ISAKMP 

Alice 


| ISAKMP 
Bob 


2. Alice’s ISAKMP begins 3 - Negotiation complete. 

negotiation with Bob’s Alice and ?° b n0 " have 

complete SAs in place C|ltl ltm „ 


120 


IPsec Encryption 















Cisco System Copyright 



IPsec Encryption 


121 














Cisco System Copyright 


1 Industry Update 


1 Cisco Update 



• IOS IPSec in early field trials, 

• Automotive Industry (ANX) pushing 


shipping in Q1-98 

development of IPSec 


• PIX IPSec shipping in Q1-98 

• Interoperabilty achieved at bake-offs 


• Cisco certificate authority shipping 

• Compliant releases in early 1998, 


with IPSec release 

including Cisco and Microsoft NT 5.0 


• Interoperability with Entrust server 

nn 


mi 



■ Reference Material 

• Applied Cryptography [2nd Edition], 
Bruce Schneier, Addison-Wesley 

• Draft RFCs and RFCs 

RFC 1825-29, draft-ietf-ipsec-... 

• Internet Security for Business 
Bernstein, Bhimani, Schultz, Siegel 


Ciici Stitiii 



122 


IPsec Encryption 








A SESAME Environment 

Bradley Broom Paul Ashley 


Information Security Research Center 
School of Data Communications 
Queensland University of Technology 
GPO Box 2434, Brisbane - AUSTRALIA 
broom,ashley@fit.qut.edu.au 

Abstract 

SESAME is a new security architecture compatible with Kerberos but with a number of 
additional features. These include authentication (user and entity) using public key tech¬ 
nology, an access control service based on role based access control (RBAC), delegation of 
privileges, multi-domain support again using a public key model, and an extensive auditing 
service. To investigate the usefulness of SESAME for providing a solution for network secu¬ 
rity, we have endeavoured to build a SESAME Linux environment. This paper outlines the 
results of our efforts. During the process we built a sesamized version of the XDM program, 
and also secured the common Unix applications telnet, rTools, RPC and NFS. A description 
of the sesamized applications is given as well as performance results for each application. 


1 Introduction 

Over the last decade, many organisations have shifted their computing facilities from central 
main frames accessed from simple terminal lines, to servers accessed from personal computers 
via a local area network (LAN). Although the switch to LANs solved some problems, it also 
introduced some new problems, not the least of which are significant security issues related to 
user authentication, access control, auditing and distribution of cryptographic keys. 

SESAME [11, 12] (Secure European System for Applications in a Multi-Vendor Environment) 
is a new security architecture designed to solve the network security problems. To accomplish 
this difficult task, SESAME provides an impressive range of security services including single a 
sign-on service, authentication using Kerberos or public key techniques, a RBAC based access 
control service, delegation of privileges, data protection (integrity and confidentiality) in transit, 
extensive auditing service, multi-domain support and an implementation of the GSS-API [15]. 
We were interested in investigating SESAME’s usefulness for a number of reasons: 

• Existing proposals for using cryptographic security mechanisms for providing network se¬ 
curity are ad hoc in their approach. For example, it is not uncommon in a network to find 
Secure Sockets Layer (SSL) [6] secured telnet, ftp, web server and browser, Secure Shell 
Protocol (SSH) [21] for a secure replacement of the rtools, secure Network File System 
(NFS) [5] based on Kerberos [14] and Pretty Good Privacy (PGP) [10] for email. We be¬ 
lieve an ad hoc approach results in lack of interoperability, duplication of security services, 
excessive administration and maintenance, and a greater potential for vulnerabilities. Be¬ 
cause SESAME provides such a wide array of services, it has the potential to provide a 
uniform solution for network security. 


A SESAME Linux Environment 


123 



• Determine to what extent the access control service (RBAC) [18] of SESAME is useful for 
applications and also for file system operations. The aim of RBAC is to simplify security 
administration and we wanted to investigate this feature. 

• Determine the practicality of SESAME in terms of performance and also useability. 

With the aim of testing SESAME, we have built a sesamized Linux environment. Part of 
the project involved porting SESAME to Linux [1] so that we had SESAME in an environment 
where application, library and kernel sources were freely available. 

This paper outlines the results of our efforts in building a SESAME Linux environment. 
Section 2 contains a description of the SESAME components for a SESAME Security Domain. 
Sections 3 to 7 give a description of the sesamized XDM, telnet, rTools, RPC and NFS. This 
includes a description of how RBAC was incorporated into NFS. Section 8 describes improving 
SESAME user authentication using Smart Cards. Section 9 describes our future work. We finish 
with our conclusions. 

2 The SESAME Security Domain 



Figure 1: A SESAME Secured Domain 

Figure 1 shows the components of a SESAME security domain. Each domain has a SESAME 
Domain Security Server. This security server contains the Authentication Server (AS) (for 
authenticating users), Privilege Attribute Server (PAS) (for providing users with credentials), 
and the Key Distribution Server (KDS) (for generating session keys used by the applications). 
The security server also has an Audit Server for collecting audit information from across the 
network. 

SESAME defines SESAME Application Servers that run the server components of applica¬ 
tions. Application Servers must each have their own SESAME Privilege Verification Facility 
(PVF), a SESAME component for checking the validity of user credentials. The credentials are 
in the form of a Privilege Attribute Certificate (PAC) and the PVF checks the integrity of this 
certificate. 

SESAME also defines SESAME Application Clients that run the client components of appli¬ 
cations. In our SESAME system, Application Clients run the sesamized XDM application that 
allows them to authenticate themselves to the security server, and be provided with a PAC. 

The figure also shows that Application Servers can also act as Application Clients, allowing 
users access to client components, whilst the computer is also running server components. 

Finally note that SESAME provides a Certificate Authority (that should not be connected 
to the network), for signing of public key certificates. 


124 


A SESAME Linux Environment 








3 XDM 


1. login 


Client 


Domain Security 
Server 





XDM 

login: 




password: 



-1 

role: 





ADDlication 


2.PAC 


Figure 2: A SESAME XDM 

XDM is the Linux X-display manager. XDM provides users with the ability to login to the 
local computer (or a remote computer), and establish an X session with that computer. 

The Linux XDM has been sesamized. As shown in Figure 2, XDM is run on a SESAME 
Application Client, the user provides a username and password similarly to traditional XDM, 
and also optionally a rolename of a role that user would like to login with. The XDM has 
been modified so that the user is actually logging into SESAME Domain Security Server, and 
is authenticated by SESAME. 

If the user has an account on the Application Client, and SESAME authentication is suc¬ 
cessful, then the user is logged into the local machine. If the user doesn’t have an account, then 
XDM creates a temporary account for the user. If a rolename is specified (the user wants to 
use sesamized applications), then SESAME will provide the XDM program with a PAC for the 
user, and this PAC is stored for the user in the user’s home directory. This PAC is then used 
by application client programs to authenticate the user to remote servers. The PAC is deleted 
when the user logs out. 

4 Telnet 

Telnet [16] is a client-server based program that is mainly used to get a remote shell across the 
network. Telnet is still one of the most popular applications on the Internet. In most cases it is 
used without any security considerations at all and remains one of the most vulnerable programs: 
users’ password information is sent across the network in the clear, and telnet provides no data 
protection services for the data in transit. 



Figure 3: SESAME telnet 


A SESAME Linux Environment 


125 








Figure 3 shows how Telnet was secured with SESAME and this is indicative of how sesamized 
applications operate. The telnet client program ( telnet ) will transport the user’s PAC to the 
telnet server program ( telnetd) (instead of forwarding the user’s password), telnetd will acknowl¬ 
edge if the user’s PAC is accepted, and for mutual authentication of the server to the client, 
will send its own credentials to the client. If context was successfully established, then the 
data is confidentiality protected. Note also that if the RBAC privileges were delegated to the 
telnetd , then the server would have access to the user’s credentials and could use these for other 
applications. 

Table 1 gives the performance of Telnet running on a LAN, with 200 MHz 64MRam Penti um s 
In the table round trip time (RTT) is indicated to show that the performance is dependent on the 
speed of the authentication handshake protocol, which depends on the speed of the underlying 
network. 


Security Service 

Time 

Single Authentication 

215 ms + RTT 

Mutual Authentication 

240 ms + RTT 

gss.wrap character 

17.8 ns 

gss .unwrap character 

5.60 fis 


Table 1: Performance Results for SESAME Telnet 


A more detailed description of securing Telnet with SESAME can be found in [20, 3]. 

5 rTools 

The rTools [13] are a suite of remote host-to-host communication programs. They have been 
ported to most Unix based systems due to their popularity and usefulness. One of the ma¬ 
jor features of the tools, is the ability to access resources (files, programs) on the remote host 
without explicitly logging into the remote host. The security is ’host’ based, in that as long 
as a connection appears to be coming from a ’trusted’ host it is accepted. The ease of mas¬ 
querading as a host by simply changing a computer’s address is widely documented, and this is 
the rtools’ biggest weakness and the reason they have been disabled in most security conscious 
environments. The rtools also provide no data protection services. 

We have focussed on sesamizing three of the rtool programs: rlogin (remote login), rsh (re¬ 
mote shell), and rep (remote file copy). The sesamized rtools have the same security advantages 
as Telnet. 

Table 2 gives the performance of the rtools running on a LAN, with 200 MHz 64MRam 
Pentiums. 

A more detailed description of securing the rTools with SESAME can be found in [20, 3]. 


6 Remote Procedure Call 

Unix provides the RPC, which is a system that allows an application client program to execute 
a procedure on a remote server. The RPC system allows an application programmer to build 
a networked client-server application quickly, as the programmer can focus on the application 
and use the RPC system for the network transport. 

We have focussed on securing Open Network Computing (ONC) RPC [19] because it is 
arguably the most popular version of RPC and is the version available on most Unix systems. 


126 


A SESAME Linux Environment 




Program 

Security Service 

Time 

rlogin 

Single Authentication 

190 ms + RTT 


Mutual Authentication 

210 ms + RTT 


gss.wrap character 

10.1 fis 


gss.unwrap character 

32.1 /is 

rsh 

Regular ’rsh cat file’ 

5.11 s 


SESAME ’rsh cat file’ 

6.41 s + RTT 

rep 

Regular ’rep file’ 

3.29 s 


SESAME ’rep file’ 

5.41 s + RTT 


Table 2: Performance Results for SESAME rtools 


The current version of ONC RPC provides only limited security: the most popular Unix flavor 
provides minimal security, and the DES and Kerberos flavors only provide strong authentication. 
We have provided a SESAME flavor security for ONC RPC similarly to work done recently at 
Sun Microsystems [8, 7] with Kerberos. 

Existing applications that use RPC require minimal modifications to use the sesamized RPC. 
The client only needs to be modified to request the new SESAME flavor, and the server only 
needs to be modified to use the client’s RBAC privileges. We have made the privileges available 
to the server in a convenient structure that requires minimal modification to server code. 

The following security services are available to SESAME RPC: SESAME single or mutual 
authentication of client and server, a choice of no protection, integrity only, or integrity and 
confidentiality data protection, RBAC access control allows the server to acquire the client’s 
role. Because RPC is an infrastructure to build networked applications, we have thus provided 
a SESAME secured infrastructure for building secured networked applications. 

Table 3 gives the performance of ONC RPC running on a LAN, with 200 MHz 64MRam 
Pentiums. 


Flavor 

Security Service 

Time 

Unix 

Single Authentication 

0.32 ms 


Procedure Call 

0.35 ms 

SESAME 

Single Authentication 

360 ms + RTT 


Mutual Authentication 

483 ms + RTT 


Call 

0.42 ms 


Call (Int.) 

2.4 ms 


Call (Int.+Conf.) 

2.5 ms 


Table 3: Performance Results for SESAME ONC RPC 


A more detailed description of securing the rTools with SESAME can be found in [20, 3]. 

7 RBAC NFS 

The NFS from Sun Microsystems is a system for sharing files across networks. NFS is described 
as transparent, because users access local files and remote files in exactly the same way. 

Lately because of the low cost of networking, and because NFS is an open standard, the use of 
NFS has increased considerably. Unfortunately NFS to date has been regarded as insufficiently 


A SESAME Linux Environment 


127 





secure, so much that most organizations are unwilling to use it except across internal closed 
networked environments. 

Our sesamized version of NFS has the following advantages: 

• The mount protocol has been secured so that client and server are mutually authenticated, 
the request and returned file handle are data protected. 

• Access to files has been secured. Users and the NFS Server are mutually authenticated, 
and all requests and replies are data protected. 

• The filesystem supports RBAC. The NFS Server has been modified to give access using 
the user’s role, rather than traditional UID/GID. 

To support RBAC, the NFS Server has been modified in the following ways: 

• The role in the user’s PAC is translated to a local GID or GIDs (via a cached lookup of 
/etc/group). 

• RBAC is implemented using normal Unix UID/GID semantics, except when creating files 
or directories. In each directory, a .rbac file, if it exists and is valid is used to control file 
creation. To be valid the .rbac file must be owned by the resource manager concerned, 
and not imply privileges not possessed by the resource manager. 

• New files and directories are owned by the resource manager concerned and the permissions 
are set according to the contents of the .rbac file. For example: nurse (rw), others (r). 

On the first NFS request per user, the PAC is transferred and is acknowledged. On subse¬ 
quent NFS requests, other than file or directory creation, only require cached UID/GID lookup. 
Because file and directory creation requests together are less than 1% of all NFS requests [2], 
the performance is satisfactory. 

Table 4 gives the performance of NFS running on a LAN, with 200 MHz 64MRam Penti ums 


Function 

Security Service 

Time 

Mount Protocol 

Mutual Authentication 

RPC Overhead 

User Access 

Mutual Authentication (First time only) 

RPC Overhead 


Subsequent Access Overhead (NFS Server) 

8/zs 


Create File (no .rbac) 

1.2 ms 


Create File (.rbac exists) 

7.3 ms 


Create Directory (no .rbac) 

2.7 ms 


Create Directory (.rbac exists) 

9.9 ms 


Table 4: Performance Results for SESAME NFS 


8 Strengthening SESAME User Authentication with Smart Cards 

SESAME provides two mechanisms for user authentication to the domain security server: 

• Kerberos user authentication: this is the authentication used when username and password 
are provided (such as through the XDM program). 


128 


A SESAME Linux Environment 












• Public key authentication: in this case a user’s private key is stored in the user’s file system 
and used for authentication. 

Bellovin and Merritt [4] describe the limitations of the Kerberos user authentication. The 
important limitations are: 

• Host based attacks: Although the user is not sending the password in the clear across the 
network, the password is in the clear on the client machine and is hence vulnerable to a 
host based attack. 

• Dictionary attacks: Because the key space of passwords is small, an off-line attack can find 
passwords very quickly. 

• Compromise of the AS is catostrophic: Because the AS holds user s secret keys, if the AS 
is ever compromised then an attacker could impersonate anyone. 

We have aimed to remove the first two limitations (the third limitation cannot be resolved 
as it is a limitation of symmetric key authentication rather than just Kerberos). To reduce the 
limitations we are building a smart card system that can hold the user’s secret key (this time it 
can be a full DES key so the dictionary attack is not possible), and move the processing to the 
card (so that a host based attack is not possible). 

The SESAME public key authentication has two limitations: 

• Host based attack: The user’s private key is stored on the file system so is vulnerable to 
a host based attack. 

• Impersonation of the certificate authority: It is unclear where the user is to store the CA’s 
public key. The integrity of the CA’s public key is vital for certificate verification. 

Both of the limitations can be removed if a Smart Card system is used. If the Smart Card 
stores the user’s private key, the CA’s public key, and all processing is done on the card, then 
the limitations are eliminated. 

9 Future Work 

Our work on building a SESAME Linux environment will continue in the future. We aim to 
perform the following application work: 

• sesamize the file transfer protocol (ftp) [17]. 

• sesamize a web server and browser. 

• sesamize an email application. 

We would also like to improve SESAME itself: 

• Provide a finer degree of delegation control, possibly using SDSI certificates [9]. 

• Improve the GSS-API implementation. Currently SESAME supports only GSS-API Ver¬ 
sion 1. GSS-API Version 2 [15] is now available and has some important extra rou¬ 
tines. Also examination of non-context based GSS-API proposals for implementation with 
SESAME should be reviewed. 


A SESAME Linux Environment 


129 



10 Conclusion 


With its wide array of security services, SESAME appears to be a good solution to providing 
a uniform approach to network security. Its performance is adequate on modern hardware and 
its access control service based on RBAC seems suitable for both applications and at the file 
system level. 


References 

[1] P. Ashley and B. Broom. Implementation of the SESAME Security Architecture for Linux. 
In Proceedings of the AUUG Summer Technical Conference , Brisbane, Qld., April 1997. 

[2] P. Ashley and B. Broom. Role Based Access Control for NFS Using Unix Groups. In 
(Submitted to) 3rd ACM Workshop on Role Based Access Control , 1998. 

[3] P. Ashley, M. Vandenwauver, and B. Broom. A Uniform Method for Securing Unix Ap¬ 
plications Using SESAME. In Third Australian Conference on Information Security and 
Privacy , Brisbane, Qld., July 15-17 1998. 

[4] S. Bellovin and M. Merritt. Limitations of the Kerberos Authentication System. In Pro¬ 
ceedings of the USENIX Winter ’91 Conference, pages 253-267, Dallas, Tx., 1991. 

[5] B. Callaghan, B. Pawlowski, and P. Staubach. NFS Version 3 Protocol Specification, 1995. 
RFC1813. 

[6] T. Dierks and C. Allen. The TLS Protocol Version 1.0, May 1997. Internet Draft IETF 
Transport Layer Security WG. 

[7] M. Eisler, A. Chiu, and L. Ling. RPCSEC-GSS Protocol Specification, 1997. RFC2203. 

[8] M. Eisler, R. Schemers, and R. Srinivasan. Security Mechanism Independence in ONC 
RPC. In Proceedings of the 6th USENIX Security Symposium , San Jose, CA., July 1996. 

[9] C. Ellison, B. Frantz, R. Rivest, and B. Thomas. Simple Public Key Certificate, April 1997. 
Internet Draft IETF Simple Public Key Infrastructre WG. 

[10] S. Garfinkel. PGP: Pretty Good Privacy. O’Reilly &; Associates, Inc., January 1995. 

[11] P. Kaijser. SESAME The European Solution to Security For Open Systems. In Proceedings 
of the 10th World Conference on Computer Security, Audit and Control COMPSEC, pages 
289-297, London, UK, October 1993. 

[12] P. Kaijser, T. Parker, and D. Pinkas. SESAME: The Solution To Security for Open Dis¬ 
tributed Systems. Computer Communications, 17(7):501-518, July 1994. 

[13] B. Kantor. BSD rlogin, September 1991. RFC1258. 

[14] J. Kohl and C. Neuman. The Kerberos Network Authentication Service V5, 1993. RFC1510. 

[15] J. Linn. Generic Security Service Application Program Interface Version 2, 1997. RFC2078. 

[16] J. Postel and J. Reynolds. Telnet Protocol Specifications, 1983. RFC854. 

[17] J. Postel and J. Reynolds. FTP : File Transfer Protocol, 1985. RFC959. 


130 


A SESAME Linux Environment 



[18] R. Sandhu, E.J. Coyne, H.L. Feinstein, and C.E. Youman. Role-Based Access Control 
Models. IEEE Computer , pages 38-47, February 1996. 

[19] R. Srinivasan. Remote Procedure Call Protocol Specification Version 2, 1995. RFC1831. 

[20] M. Vandenwauver, P. Ashley, M. Rutherford, and S. Boving. Using SESAME s GSS-API to 
Add Security to Unix Applications. In IEEE Third International Workshop on Enterprise 
Security , September 1998. 

[21] T. Ylonen, T. Kivinen, and M. Saarinen. SSH Protocol Architecture, November 1997. 
Internet Draft, IETF Secure Shell Working Group. 


A SESAME Linux Environment 


131 



132 


A SESAME Linux Environment 



Intranet Security technologies - Sesame or SSL? 


Paul Ashley* Joris Claessens* 

Gary Gaskell* Mark Vandenwauver* 

+ Information Security Research Centre * Computer Security and Industrial Cryptography 

Queensland University of Technology Katholieke Universiteit Leuven 

GPO Box 2434 Brisbane 4001 Australia ESAT/COSIC Kardinaal Mercierlaan 

94 3001 Heverlee Belgium 

http://www.isrc.qut.edu.au http://www.esat.kuleuven.ac.be/cosic 

{ashley,gaskell}Qisrc.qut.edu.au {Mark.vandenwauver,j oris.claessens}Qesat.kuleuven.ac.be 


Abstract 

In recent years there has been a phenomenal increase in the use of networking. The tech¬ 
nologies of the Internet are becoming the standard for organisational information retrieval. 
Due to their success it has become possible to create a virtual office environment for an 
organisation, regardless of the geographical diversity of the personnel. As such the security 
of an organisation’s information is in doubt unless steps are taken to protect the information 
assets. 

Cryptographic solutions are essential to protect computer assets where the solutions are 
deployed over untrusted networks. With the value of information on an organisation’s net¬ 
work and the complexity of the IT infrastructure it is often inadvisable to place a high degree 
of trust even on internal networks. The internal networks are likely to have components that 
are interconnected using infrastructure that is outside of the control of the owner of the 
information. Hence the use of cryptographic techniques is required to enhance the security 
of an organisation’s information as it travels and is accessed over untrusted networks. 

Generically speaking the threats to the information resources include integrity of the in¬ 
formation and loss of confidentiality (access by unauthorised persons). This paper compares 
SSL and SESAME by discussing: what security services they provide, what cryptographic 
technology is used by each solution, the applications supported, where they fit in the TCP/IP 
model, availability and finally a description of their limitations. The objective is to compare 
these security technologies with a view to using them for Intranet security solutions. 

This paper concludes that SSL is a strong security solution for some Internet applications, 
however essential services such as authorisation are missing from SSL and an application level 
architecture such as SESAME is highly desirable. It also concludes that SESAME provides 
a level of services well above that of SSL and thereby greatly assists an organisation in the 
efficient management of its IT security. 


1 Introduction 

An Intranet is becoming a core component of most organisations infrastructure. The Internet 
technologies used in an Intranet are seen as an effective way to exchange information. Future 
growth in the use of Intranets and the Internet depends very heavily on how well they are 
secured. The problem of securing networks can be approached in numerous ways, the most 
popular ones to date are: 


Intranet Security technologies - Sesame or SSL? 


133 



• Application level security (for example: corporate applications, telnet, ftp, ...) 

• Networking level security. This might be achieved by adding security at various layers in 
the network hierarchy (for example networking, transport...) 

• Physical security 

There are likely to be many different security needs. It is important to gain an understanding 
of what security services and features are required. Some examples may be: authentication of 
users and network entities, data protection in transit (integrity, confidentiality and authentica¬ 
tion of sender), non-repudiation of data receipt, access control for users, single sign-on solutions, 
key distribution, auditing and the security of system bootstrapping. There are also other con¬ 
siderations when designing a network solution such as cryptographic policy and key recovery 
requirements. 

The purpose of this paper is to examine two recent solutions for network security, SESAME 
and SSL, and their suitability for Intranet security solutions. The paper aims to highlight the 
differences in the two solutions. The paper is structured as follows. The first section gives a brief 
overview of the SESAME and SSL technologies. The next section compares these two solutions. 
From this comparison we conclude that SSL may not be suitable for a majority of Intranet 
deployments due to its lack of user authorisation services and due to the superior manageability 
of security by SESAME. 

The paper presents the supporting arguments for this conclusion and finishes with some 
recommendations. 

2 An Overview of the SESAME and SSL Technologies 
2.1 SESAME 


Domain Security Server 



Figure 1: Overview of the SESAME components 


134 


Intranet Security technologies - Sesame or SSL? 











SESAME [10, 11,15] is the name of a security architecture. As a security architecture it describes 
where in a system various security services are required. 

It is the result of a collaboration of Bull, ICL and Siemens together with some leading 
European research groups. The project was funded by the EC under the auspices of its RACE 
program. SESAME is an acronym for “A Secure European System for Applications in a Multi¬ 
vendor Environment”. Figure 1 gives an overview of the SESAME architecture. At first glance 
it looks very complex but it is possible to distinguish four boundaries in the architecture: the 
client, the domain security server, the (application) server, and the support components. The 
complexity corresponds to the level of services provided. 

The client system incorporates the User, User Sponsor (US), Authentication Privilege At¬ 
tribute (APA) Client, Secure Association Context Manager (SACM) and client application code. 
The User Sponsor is the user’s interface to the SESAME system. This allows the user to logon. 
The APA is used by the User Sponsor for the communication with the domain security server. 
The SACM provides the data protection services (data authentication, data confidentiality, non¬ 
repudiation) for the client-server interaction. 

The Domain Security Server is very similar to the Kerberos one [12]. The main difference is 
the presence of the Privilege Attribute Server (PAS) in SESAME. The server has been added to 
manage the access control mechanism that is implemented by SESAME. Because of its many ad¬ 
vantages SESAME has opted to implement role based access control (RBAC) [16]. This scheme 
is enforced using Privilege Attribute Certificates (PACs) [6]. The function of the Authentication 
Server (AS) and Key Distribution Server (KDS) (ticket granting server in Kerberos) are similar 
to their Kerberos counterparts: providing a single sign-on and managing the cryptographic keys. 
A major difference with Kerberos is that SESAME also supports public-key based authentication 
using the X.509 authentication mechanism [9]. 

When the application server receives a message from an application client indicating that it 
wants to set up a secure connection, it forwards the client’s credentials and keying material (an 
encrypted session key) to the PAC Validation Facility (PVF), which checks whether the client 
has access to the application. If this check is successful, it decrypts the keying material and 
forwards the session keys (SESAME uses independent keys for providing data authentication 
and data confidentiality) to the SACM on the server machine. Through this, the application 
server authenticates to the client (mutual authentication) and it also enables the application 
server to secure the communication with the client. 

The SESAME architecture provides a number of support components used throughout the 
system. These include the Audit facility (providing detailed audit logs), Cryptographic Support 
Facility (CSF) (providing the various cryptographic primitives), and the Public Key Manage¬ 
ment (PKM) facility. 

2.2 SSL 

The most dominating factor in the growth of the Internet in recent years has been the success 
of the World Wide Web (WWW). The Secure Sockets Layer (SSL) solution [7] from Netscape 
Communications is an attempt to provide security for the WWW. It is also used to secure other 
applications such as telnet and ftp. 

The original version of SSL was SSL 2.0 [8]. This version contained a number of security 
flaws. These flaws were solved in the SSL 3.0 specification [7]. Both versions 2.0 and 3.0 are 
currently used and are a de facto standard. In 1997 the IETF formed a Transport Layer Security 
(TLS) working group which has adopted the SSL 3.0 protocol in its initial release of their TLS 
protocol [5] (including some minor changes). This protocol is intended as a generic solution to 
secure the transport layer (see Figure 3) and would be independent of the actual application. 


Intranet Security technologies - Sesame or SSL? 


135 



In the remainder of this paper, SSL is used to denote the SSL/TLS standard. 

SSL is situated underneath the application layer. It provides entity authentication, data 
authentication and data confidentiality, where the entity authenticated is the workstation. The 
structure of the SSL protocol is outlined in Figure 2. It consists of two levels: the Record Layer 
Protocol and four other protocols of which the most important is the Handshake Protocol. 

At the beginning of a session, the client and server perform a handshake, during which both 
entities negotiate on the algorithms and cryptographic keys that are going to be used. Both 
entities (i.e. the client and server machines) are also optionally authenticated. Strictly speaking 
it is the machines that are authenticated and not the users. The SSL specification does not 
define ‘client authentication’. What exactly client means is open to interpretation, however at 
the sockets layer we take this to be the machine. The new security parameters are used after 
the ‘changecipherspec’ message at the end of the handshake. 

All data is sent through the Record Layer which provides three functions: fragmentation, 
compression and cryptographic protection. No compression algorithms are defined yet. Cryp¬ 
tographic protection means including a MAC for data authentication and using encryption for 
data confidentiality. 

Client and server do not always have to perform a complete handshake. It is possible to use 
the same security parameters negotiated in the previous connection. In SSL terminology, this 
means that a new connection can be established within the same session. 



Figure 2: SSL Structure 


3 Comparison 

This section attempts to describe the differences between SESAME and SSL. 

3.1 Positioning in the Networking Model 

Figure 3 shows the OSI and TCP/IP reference models and the relationship between them. The 
OSI model [14] has seven layers and is shown on the left of the figure. The TCP/IP model [4] 
has four layers and is shown on the right. The figure also shows where SESAME and SSL are 


136 


Intranet Security technologies - Sesame or SSL? 
























7 

Application 




SESAME 

6 

Presentation 


Application 


5 

Session 



_ 


4 

Transport 


UDP/TCP 

Sockets Layer 

3 

Network 


IP 


2 

Datalink 


Drivers and 


1 

Physical 


Hardware 



SSL 


OSI model TCP/IP model 


Figure 3: The Positioning of SESAME and SSL in the TCP/IP Model 


situated in the model. SESAME is an application level solution, whereas SSL is a transport 
layer solution (or more specifically a socket layer solution). This socket layer has recently been 
proposed as a layer that could be added to the TCP/IP model between the transport layer and 
application layer. 

Placing the security solutions at alternate layers in the model causes a number of differences: 

• How the applications interact with the security services; 

• What security services are available to applications. 

Applications that use security services at the application layer have to be modified to use 
these services. In the case of SESAME, applications must be modified to make calls to the 
SESAME security libraries, often using the Generic Security Services API (GSS-API). This 
me ans that the application code not only performs the function of the application, but must 
also be aware of the security services it would like to use. This also has advantages, as the 
application and/or PAC Validation Facility (PVF) is aware of the user that wishes to take an 
action. Often it is the application layer only that knows whether this action is valid or otherwise. 
Providing security services at the socket layer means that applications (at least in theory) do 
not have to be as security aware (in some cases not at all). In the case of SSL, applications that 
are built using the normal socket library, can be SSL secured by simply rebuilding them with a 
new SSL socket library. This means it is possible to turn an insecure application into a secure 
one without changing a line of application code. 

The difference in the security services that can be provided at the different layers is described 
in the next section. In general the higher the security solution is in the hierarchy, the greater 
the possibility is to provide additional services. 

3.2 Security Services Provided 

Table 1 gives an overview of the different security services that are provided by SESAME and 
SSL. The table indicates that the services that SSL provides are in fact a subset of the services 
provided by SESAME. This is mainly due to the fact that SESAME is situated in a higher layer 
of the model. It is however also important to note that SESAME is a whole architecture, while 
SSL is just a standard that defines in what way the communication between two parties should 
be secured. The protocol is restricted to a description of the algorithms to use and the format 
and content of the messages. 

The first security service that is only offered by SESAME is a single sign-on for users. That 
is, SESAME allows users of the SESAME secured network to log on once to the network, be 


Intranet Security technologies - Sesame or SSL? 


137 




Table 1: Security services 


Security Service 

SESAME 

SSL 

Single Sign-on for Users 

yes 

(no) 

User authentication 

yes 

(yes/no) 

Workstation authentication 

(no) 

yes 

Data confidentiality 

yes 

yes 

Data authentication 

yes 

(yes) 

Key distribution 

yes 

yes 

N on-repudiation 

yes 

(no) 

Auditing 

yes 

(no) 

Access control 

yes 

(no) 


provided with SESAME credentials, and then to use these to access resources across the network. 
The advantages of a single sign-on solution have been recognised for some time with Kerberos 
being the most well known example. A single sign-on service is not defined in SSL, as SSL is 
really a peer to peer protocol, however an application that uses SSL can implement some kind 
of single sign-on mechanism, e.g., browsers only prompt once for a password, from the moment 
the user has provided this password, sessions can be established without extra input of the user 
(providing that they are connecting to the same WWW server). The security of such techniques 
are often in question, due to the way browsers protect the secrets that have been given up to 
them. 

In both technologies data is protected while in transit with options for both confidentiality 
and integrity protection. Both solutions provide facilities for key distribution, e.g., session key 
establishment. SESAME provides a helpful extension by allowing for a site-configurable option 
of different strength confidentiality and integrity mechanisms. 

Another defining difference depends on your interpretation of the SSL specification. The SSL 
specification discusses ‘client authentication’ however, it does not define what the client is. The 
client is commonly the machine rather than the real user of the machine. The issue is further 
complicated by the position of the user in the application layer and the machine in the sockets 
layer. Some applications have required the user to present a passphrase, before allowing the SSL 
implementation on the machine access to the private key. If the private key is not protected by 
a passphrase (as in some SSL telnet and ftp implementations) then we contend the user is not 
authenticated by the use of the public key cryptography in SSL. In this situation it is only the 
machine that is authenticated. Further, we note that it is common for an application to use SSL 
to secure the channel of a communications session and to use a username and password tuple 
in the application layer to authenticate the user. We admit that some will disagree with this 
position. However, we submit that SESAME presents genuine user authentication, as it is at 
the application layer, and that the use of SSL for user authentication is debatable. 

SSL operates at the sockets layer and is therefore oblivious to to the actions of a user. Hence, 
at least from a legal viewpoint, it cannot provide a non-repudiation service. SESAME on the 
other hand provides services directly to the application and hence the non-repudiation of user 
actions is possible. It can be argued that non-repudiation is not a widely used service, at least in 
commercial applications. However this is expected to change on two fronts. One obvious driver 
is electronic commerce. The other may be the auditing and management requirements in large 
world-wide intranets (that are now deployed by multi-national corporations). 

An extensible set of audit tools is provided by SESAME. SSL does not define a method of 
auditing. However, a lot of products implement a separate logging mechanism. For example 


138 


Intranet Security technologies - Sesame or SSL? 































secure web servers can. log specific SSL related interactions. On the other hand, protection of 
these log files is as important as creating them, and this is certainly not always guaranteed. 

One of the main features of SESAME is access control, and in particular Role Based Access 
Control (RBAC). SSL does not implement an access control service. To provide some kind of 
access control to web servers it is possible to use the username/password technique (protected 
by SSL) or the client authentication (using its private key) provided by SSL. 

There is a high administrative overhead with separate passwords to protect the internal 
web pages of an intranet. It is these situations where the management benefits of RBAC can be 
realised. SESAME provides RBAC access control services and as such we believe its management 
of access control is an order of magnitude better than SSL. 

3.3 Algorithms and technology 

Table 2 gives an overview of all the cryptographic primitives and algorithms that are used in 
the different systems. 

In the SESAME column, the ’x’ denotes the fact that this algorithm is used in the public 
release. The SESAME technology however does not depend on specific algorithms. Due to the 
modular structure, it is normally easy to add or replace other algorithms. 

In the SSL column, the V means that this algorithm is defined in the standard, in other 
words, that an identifier of this algorithm has been agreed upon. It is however not necessary to 
implement every defined algorithm to have an SSL compliant application. 

Table 2: Cryptographic algorithms and technology used 


Algorithm 

SESAME 

SSL 

Certificates 

X.509v3 

X 

X 

Key distribution/agreement 

Kerberos 

X 



RSA 

X 

X 


Diffie Heilman 

(x) 

X 


Fortezza 

(x) 

X 

Bulk encryption 

DES-CBC 

X 

X 


DES-EDE3-CBC 

(x) 

X 


IDEA-CBC 

(x) 

X 


RC2-CBC 

(x) 

X 


CDMF-CBC 

(x) 

X 


RC4 

(x) 

X 


Skipjack 

(x) 

X 

Hash 

MD5 

X 

X 


SHA-1 

(x) 

X 

MAC 

HMAC 

DES-CBC-MAC 

RIPEMD-160 

(X) 

X 

(X) 

X 

Digital Signatures 

RSA 

X 

X 


DSS 

(x) 

X 

Protection methods 

Timestamps 

X 



Sequence numbers 

X 

X 


Nonces 

X 

X 

Miscellaneous 

Smartcards 

X 

(x) 


Intranet Security technologies - Sesame or SSL? 


139 




Table 3: Implementations and Supported Applications 


Package 

Description 

Where 

SSLeay 

freeware library, source code 

http : //www.ssleay.org 

SSLRef 

non-exportable reference library 

http://home.netscape.com 

SSLPlus 

non-exportable commercial library 

http://wwv.consensus.com 

SSLava 

Java library 

http://www.phaos. com 

iSaSiLk 

Java library 

http://jcewvw.iaik.tu-graz.ac.at 

SSLapps 

SSLeay based applications (TELNET, 
FTP, Apache, Mozilla) 

http : //www.ssleay.org 

Commercial 

mostly export restrictions 

Netscape, Microsoft, ... 

SESAME lib. 

free to use under license conditions, 
full source code 

[15] 

Applications 

TELNET, FTP, RPC, . . . 

[2] 

Commercial 

security servers, GSS-API 

http://www.ism.bull.net, 

http://www.icl.co.uk 

Commercial 

Intranet WWW system Siemens 

http : //www.trustedweb.com 


3.4 Applications and Availability 

Table 3 is intended to be an overview of the different available implementations of SSL and 
SESAME. It shows both the libraries and supported applications. It is not an exhaustive list. 

At this stage SSL has been focused on providing security for WWW transactions. Also telnet 
and ftp have been secured by SSL. SESAME has been used to secure Intranet applications like 
the rtools, RPC (Remote Procedure Call), NFS (Network Filesystem) and also telnet and ftp. 
There exists also a sesamized WWW application for use in an Intranet. This is available as a 
commercial product. 

SSL is particularly designed for and is suitable for WWW interactions. The security ser¬ 
vices required for these type of transactions are authentication of client and server, session key 
negotiation and data protection in transit. SSL has been designed specifically for this purpose 
and provides more options and better performance than SESAME in this simple configuration. 
The WWW, in its current form does not require other services such as a finer grained access 
control (although this is considered a weakness by some as fine grained access control is deemed 
essential for future WWW development and is indeed required in the intranet environment). 
SSL is suitable for telnet and ftp for similar reasons as the current model of the WWW. 

SESAME is more suitable for applications where finer grained access control is required and 
this is shown in its ability to be able to secure the Network Filesystem. The implementation 
allows users to be given privileges at login, and the system controls access to files using these 
SESAME privileges. 

SESAME provides security for applications by providing the Internet Standard Generic Se¬ 
curity Service Application Program Interface (GSS-API) [13]. As previously described, the 
application code has to be modified to call these routines in the appropriate places. Experience 
has shown that the GSS-API is an excellent tool for securing applications [3]. 

It should also be stated that due to its high level of services, SESAME is often integrated 
into an organisation’s custom software. 


140 


Intranet Security technologies - Sesame or SSL? 




3.5 Legal regulations and restrictions 

When discussing implementations of cryptography, it is unfortunately often necessary to mention 
legal and cryptographic policy issues. Some countries (U.S., Prance, U.K. and Australia) have a 
specific policy regarding import and, especially, export of cryptographically enhanced products. 

The U.S. export restrictions have serious implications on the level of security that can be 
obtained with SSL enhanced products, in particular the most popular WWW browsers and 
servers. Until January 1997, the strongest cryptography that U.S. companies were allowed to 
export was limited to a 40-bit security level. Recently, the export options have been increased 
to a 56-bit level, although certain restrictions still apply. 

A lot of creative ways are possible to achieve strong cryptography, like tunneling and proxy 
mechanisms. In October’97 McKay [?] published a program with which the Netscape browser 
could be patched to a version with strong crypto support, just by changing a few bytes of 
the executable. In April’98, Netscape released the source code of its Communicator browser. 
Immediately following the release, the Mozilla Crypto Group started with the integration of the 
SSLeay library in the browser, which gives it full strength cryptography. 

Also SESAME is affected by legal regulations. The SESAME project was forced to replace 
the DES algorithm in the SESAME distribution by a simple XOR operation. Therefore anyone 
who installs the SESAME software should replace this by a real DES implementation. Because 
of the nice approach of pluggable crypto, this is a fairly easy task and can be done by using a 
separate CSF (see section 2.1). In fact, this has already been done for the Linux version which 
can be downloaded from [2]. 

3.6 Other limitations 

Both SESAME and SSL share a common limitation: they both use only session oriented pro¬ 
tocols (they insist on a session being established before security services can be provided). In 
numerous applications this type of protocol is suitable, examples are WWW transactions, telnet, 
ftp, BSD rtools, remote procedure call and NFS. However certain applications do not require and 
cannot use session oriented protocols. For example email security relies on sessionless protocols 
(you certainly want to be able to receive a secured email without first establishing a session). 
It would certainly be advantageous to both SESAME and SSL if they also included support 
to secure sessionless protocols. The IETF has drafted the Independant Data Unit Protection 
(IDUP) IETF Internet Draft [1]. IDUP attempts to extend the GSS-API to support security 
for a generic data unit that is not part of a security context. 

Adding security to an application always causes a decrease in performance. This decrease 
is not only due to the cryptographic computations, but also to the extra data that has to be 
exchanged, especially for session establishment. SESAME requires large tokens (2000 bytes) 
to be sent for establishing a session. This is comparable to the total length of the handshake 
messages in SSL. However, in reality this does not form a problem. 

4 Conclusion 

SESAME and SSL have been designed for different purposes. SESAME is a comprehensive 
solution for providing single sign-on, and a range of security services for Intranet applications. 
The security services are provided at the application layer. SSL is designed for providing secure 
connections to Internet applications by placing its security at the transport layer. It provides 
numerous choices for cryptographic solutions. 


Intranet Security technologies - Sesame or SSL? 


141 



SESAME and SSL could be considered complementary technologies. However SESAME 
provides more services and is a more manageable solution. 


References 

[1] C. Adams. Independant data unit protection generic security service appli¬ 

cation programming interface (idup-gss-api). IETF Internet Draft, May 1998. 
ftp://mirror.aarnet.edu.au/ietf/internet-drafts/draft-ietf-cat-idup-gss-ll.txt. 

[2] P. Ashley. ISRC SESAME Application Development Pages, http:// 

www .fit. qut.edu.au/"ashley / sesame. html. 

[3] P. Ashley, M. Vandenwauver, and B. Broom. A Uniform Approach To Securing Unix 
Applications Using SESAME. In Information Security and Privacy - Proceedings of the 
Third Australian Conference on Information Security and Privacy ACISP’98, pages 24-35, 
Brisbane, Australia., July 1998. Lecture Notes in Computer Science - Springer Verlag. 

[4] S. Bellovin. Security Problems in the TCP/IP Protocol Suite. Computer Communications 
Review , 19(2):32—48, April 1989. 

[5] T. Dierks and C. Allen. The TLS Protocol Version 1.0, November 1997. Internet Draft. 

[6] ECMA 219. ECMA-219 Security in Open Systems - Authentication and Privilege Attribute 
Security Application with Related Key Distribution Functionality, 2nd Edition, March 
1996. European Computer Manufacturers Association. 

[7] A.O. Freier, P. Karlton, and P.C. Kocher. The SSL Protocol Version 3.0, March 1996. 
Internet Draft. 

[8] K. Hickman and T. Elgamal. The SSL Protocol, 1995. Internet Draft. 

[9] ITU. ITU-T Rec. X.509 (revised). The Directory - Authentication Framework, 1993. 
International Telecommunication Union, Geneva, Switzerland. 

[10] P. Kaijser. SESAME The European Solution to Security For Open Systems. In Proceedings 
of the 10th World Conference on Computer Security, Audit and Control COMPSEC, pages 
289-297, London, UK, October 1993. 

[11] P. Kaijser, T. Parker, and D. Pinkas. SESAME: The Solution To Security for Open Dis¬ 
tributed Systems. Computer Communications, 17(7) :501—518, July 1994. 

[12] J. Kohl and C. Neuman. The Kerberos Network Authentication Service V5, 1993. RFC1510. 

[13] J. Linn. Generic Security Service Application Program Interface Version 2, 1997. RFC2078. 

[14] OSI. OSI Reference Model - Part 2 : Security Architecture. ISO Information Processing 
Systems, ISO, Geneva, Switzerland, 1988. ISO 7498-2. 

[15] M. Vandenwauver. The SESAME home page, http://www.esat.kuleuven. 

ac.be/cosic/sesame. 

[16] M. Vandenwauver, R. Govaerts, and J. Vandewalle. How Role Based Access Control is 
Implemented in SESAME. In Proceedings of the 6-th Workshops on Enabling Technologies: 
Infrastructure for Collaborative Enterprises, pages 293-298. IEEE Computer Society, 1997. 


142 


Intranet Security technologies - Sesame or SSL? 



Remote Operating System Identification 

Anthony Osborne - anthony@softway.com.au 


Abstract 

The popular TCP/IP protocols are implemented from publicly available standards with each 
implementation complying to varying degrees. Within each standard of the TCP/IP protocol family 
there is a certain flexibility that is available to the implementors of the specification. These subtle 
variations between implementations provide for the possibility of remote system identification. 
Using these differences it will be demonstrated how to remotely identify operating systems in a 
stealthy, efficient and reliable manner. While the possibility of identification does not present a 
direct vulnerability, significant information leakage can occur potentially effecting the long-term 
security of your hosts. To defend against such probes we outline a base set of requirements that 
may be utilised to protect your networks. 


Introduction 

This paper outlines what information maybe obtained through remote identification, methods to 
obtain this information and the possible negative effects upon the long-term security of your hosts 
and networks. 

Currently many networks are implemented with little concern that the operating systems utilised by 
hosts are well known. In addition, numerous network services are installed with default banners 
unchanged. 

The ability to remotely identify operating systems provides the security community with an 
advantage that has not been integrated within today's network security tools. Current network 
security scanners require manual configuration to develop specific rulesets for different operating 
systems. The most obvious example being that the majority of Unix security problems are 
significantly different from security issues on the Windows NT platform. 

Remote identification is a double-edged sword that will allow for further automation of network 
vulnerability testing as well as large-scale network penetration. It must be emphasised that remote 
identification provides additional information about a host or network but it does not represent a 
direct vulnerability in itself. 

Importance of Remote Identification 

The impact of remote identification is highlighted via the information that can be obtained by 
sending and receiving one or two packets. Careful analysis may reveal the vendor of the operating 
system, the version and in specific situations the patch status of the host. 


Remote Operating System Identification 


143 



Once the underlying operating system of the host is obtained, the host may be further probed to 
identify specific software installed. To identify such software it must modify the underlying 
protocol stack, the principal example being firewall software. Finally, in many cases it is possible 
to identify whether the host is protected by packet filtering systems such as routers. 

In 1991 Bellovin [1] outlined major security implications in both the DNS protocol and its 
implementations. These security vulnerabilities were well documented, but it was not until 1997/98 
that DNS spoofing attacks saw mass application upon the Internet. In a similar vein the removal of 
application version information has been popular within the security community but is not common 
practice and the obscurification of host responses at a kernel level is almost non-existent. If 
automated tools are distributed, the full impact of operating system identification may have similar 
impacts upon network security. The popularity of buffer overflow attacks against Unix hosts is a 
prime example where identification will provide advantage that may be leveraged by attackers to 
penetrate a host. Furthermore, identification allows for specific automated attacks that in the past 
have required large scale scanning. 


Current Popular Methods of Identification 

It is important to review the current publicly known methods of remote operating system 
identification. Firstly, the aim is to elicit a response from a host that is unique to one specific 
vendors implementation. To this end a number of the following strategies have been employed in 
the past. 

CERT during early 1998 [2], "received reports of widespread scans to TCP port one. The service 
assigned to TCP port one is tcpmux. By default, IRIX systems have tcpmux enabled. Once the 
intruder found a number of machines with a service running on port one/tcpmux, the intruder then 
used another automated tool to telnet to each of these machines and attempt to log in as guest, lp, 
and demos." 

This basic form of system identification has provided sufficient information to identify vulnerable 
hosts. The fact that tcpmux is installed in itself is not a direct threat even though IRIX installs this 
service by default creating a vulnerability. This example clearly demonstrates that associating 
several pieces of information will present attackers with new opportunities. 

More sophisticated tools have begun to utilise subtle differences in TCP implementations. 
Specifically such programs are looking for differing responses to certain TCP flags, set in the TCP 
packet header. For example, sending a combination of a SYN/FIN in a TCP header results in 
differing responses. Linux responds with a SYN/FIN while others produce a SYN while other 
implementations respond with an ACK. The limiting factor with such tools is that they rely upon 
open ports which are not guaranteed. 


Other Possibilities 

To elicit a unique response from a host there are numerous protocols we may utilise. The two main 
players at the transport layer being TCP and UDP with IP, a network layer protocol providing for 
their delivery. In addition, ICMP communicates error messages at the network level between IP 
systems. 


144 


Remote Operating System Identification 



Before any detailed investigation may take place it may be useful to reflect upon the current 
utilisation of these protocols. Due to the abuses of ICMP it is often filtered at network gateways, to 
the detriment of IP. Sending ICMP requests will not always result in the proper ICMP response. 
Based upon the above issues obtaining information in response to ICMP requests will not always 
be successful. 

UDP provides essential naming services to the Internet, therefore must be supported if only on 
selected ports. Sending UDP packets to closed ports generally results in ICMP error messages. 
During research for this paper it was found that a significant portion of networks would allow 
inbound UDP traffic to closed ports. Filtering inbound ICMP packets is quite popular although 
filtering outbound packets is not. 

TCP, the most complex and popular protocol provides the majority of essential services upon 
TCP/IP networks. This suggests that TCP will provide for a high level of accuracy when 
identifying operating systems. The popularity of utilising TCP flags for remote identification is 
only the beginning of the numerous variations within this protocol. 

With the abovementioned protocols the majority of variations between implementations are found 
by sending unexpected or uncommon input. Further to this, if the resulting output is not relied upon 
for connectivity then the likelihood of variation is maximised. With this in mind the UDP protocol 
is analysed for its usefulness in remote operating system identification. 


Case Study: UDP 

An excellent method for the identification of the vendor of a particular operating system is by 
sending UDP packets to a host. When sending a UDP packet to a host the following outcomes may 
occur: the UDP packet is passed to an application on the destination host with a potential response 
returned; an ICMP error message returned by the destination host or intermediate host; or no 
response is returned. 

If an ICMP port unreachable message is received in response to this UDP packet, an arbitrary 
amount of the original packet will be returned so that at the discretion of the implementation the 
sending application may be notified of the error. From RFC 1122 [6], "every ICMP error message 
includes the Internet header and at least the first 8 data octets of the datagram that triggered the 
error; more than 8 octets MAY be sent; this header and data MUST be unchanged from the 
received datagram". 

It is important to note that numerous tests can be applied to the ICMP response. Testing a variety of 
different operating systems has shown that each will respond in a differing manner, to the effect 
that 13 different vendor's implementations can currently be identified. It is somewhat surprising to 
note that even the *BSD implementations differ to the degree that each is distinguishable. 

One specific test that may be applied to the ICMP response that has been previously documented 
[4] is an implementation bug. This is only one of many simple but effective differences that may be 
noted. The original 4.4 BSD networking code when returning an ICMP port unreachable response, 
to a UDP packet will cause the IP length in the IP header returned to be 20 bytes too large. 
Immediately, we can identify any 4.4 BSD operating system or implementation derived from this 
networking code that has not fixed the problem. 


Remote Operating System Identification 


145 



From testing we note that BSDI still returns the incorrect IP length but both FreeBSD and 
OpenBSD have resolved this feature. Conventional security methodology would suggest that the 
ICMP response does not represent a vulnerability, however interesting information may be 
obtained. 

Another previously undocumented test is to examine the length of the IP datagram that triggered 
the error. From above RFC 1122 suggested that eight octets or more may be returned and in 
practice the actual length falls within the flexibility of the RFC. For example, the majority of 
implementations return the eight octets, with two noticeable exceptions. These being Solaris 2.5.1 
and 2.6 return 64 octets while Linux will return the most of the original datagram up to around 528 
octets. Therefore, looking at the size of the ICMP port unreachable, provides information that will 
guarantee the identification of unmodified Linux and Solaris hosts. 

The first issue that may be encountered is that packet filters often block UDP packets unless 
directed to specific applications. UDP is a basic test that provides a valuable insight into the overall 
security of a site. Often an ICMP error message has been returned informing us that we are 
"prohibited" from sending UDP packets. 

In certain cases the destination host will return no response. If other protocols such as TCP receive 
responses we can assume that some form of packet filter is dropping UDP requests. The 
implementation of packet filters dropping non-essential packets shows heightened security 
awareness; this knowledge will effect any further probes to the specific host or network. 


Remote Identification Security Issues 

With any security issue there are a number of questions that arise. Firstly, does this problem affect 
any software that installed upon our hosts and if so are they vulnerable? It has been noted that there 
is no direct vulnerability, but the long-term security of hosts must be questioned. At this point we 
have shown that the current operating system of a host may be identified and logged. Based upon 
this information network security scanners and/or potential attackers may take snap shots of the 
configuration of your network. Once this information is entered into a database significant security 
risks will become apparent. For example, the time between the release of a remote exploit and the 
exploitation of vulnerable systems will decrease exponentially. In addition, attackers do not need to 
return to your network until a new vulnerability has been identified. In relation to the current status 
of Internet this is a particularly frightening thought. 

Secondly, if I am vulnerable is there any attack pattern that I should be monitoring for? Looking for 
packets that originate from a single source address, destined for a variety of hosts within your 
network many prove fruitful. To complicate matters a variety of different protocols may be used 
and disguised as legitimate connections from a variety of source hosts over an extended period of 
time. The net effect being that it is extremely difficult to identify remote probes against your 
network. 

At this point, the obvious solution is to "firewall" your hosts and networks from the Internet. The 
main issue with firewalling internal hosts from the Internet is that it is possible to identify the 
bastion host operating system as well as the firewall software installed. More importantly not all 
firewall technologies prevent the identification of internal hosts. Possible defences against this 
issue are addressed below. 


146 


Remote Operating System Identification 



Defences 


Today's business environment requires connectivity to external networks such as the Internet. The 
requirements for security and external connectivity must be balanced to ensure that the 
consequences of remote identification are minimised to an acceptable level. Two obvious solutions 
are available, these being firewall sensitive hosts and networks or to obscure the TCP/IP protocol 
stack of your host. 


Obscure Responses 

The majority of fields returned in TCP/IP packet headers are set at a kernel level. Each TCP/IP 
implementation has a distinct response thus ensuring that each vendor s operating system is 
uniquely identifiable. To obscure a kernel response, either user level programs (such as ndd or 
sysctl) must be utilised or the actual kernel must be modified. The high level of technical expertise 
required to make these modifications should ensure that operating systems remain identifiable on 
the majority of hosts. In addition, there are so many variables within UDP, TCP, ICMP and ARP 
such that if all were changed it would be questionable whether your host would still inter-operate 
with others. 

The modification of the underlying operating system has subtle ramifications. When an 
implementation is modified, fixing bugs or adding new features will affect the version information 
that may be obtained from the host. Theo de Raadt, lead OpenBSD developer, discussed [7] the 
issues surrounding the modification of OpenBSD's ICMP responses to ensure that it returned the 
correct data. He noted "that it would be nice to fix the discrepancies one of these days, either way, 
you can't win: your identified. I believe our stack sends back slightly incorrect ICMP data. 
OpenBSD would once again be identifiable by what we've fixed and what we have not". It is 
interesting to note that OpenBSD did modify their IP stack to return the correct data. Now it is 
possible to uniquely distinguish between OpenBSD version 2.2 and 2.3 remotely and in turn from 
any other operating system. 


Firewalls 

Firewalls in general are portrayed as the total security solution for Internet connectivity and in the 
case of remote operating system identification this is most definitely not the case. 

The most basic firewalling technology, packet filtering provides little protection due in part to the 
fact that it is not session based. Even if access lists are correctly configured they evaluate individual 
packets without context. An excellent example of this is when Cisco access lists utilise the 
"established" keyword. The net effect is that they drop the initial SYN packet of a connection but 
allow numerous packets to pass including the response. External parties can identify operating 
systems behind the majority of basic packet filtering technologies by creating TCP packets that do 
not include the SYN flag. 

More sophisticated "stateful" packet filtering firewalls are session based, but they suffer from other 
issues. As the name suggests stateful packet filters do maintain session information. During testing 


Remote Operating System Identification 


147 



it was noted at least one implementation did not rewrite all values of the TCP/IP packet headers as 
they are passed through the stateful packet filter. The direct result being internal hosts provided 
enough information to remotely identify their operating system. 

Traditional application proxy based systems by nature recreate a new packet header and are session 
based. Since the TCP/IP headers are recreated, identifying hosts behind the firewall cannot be done 
through methods outlined within this paper. Whatever underlying technology your firewall utilises, 
it must rewrite the majority of fields in TCP/IP packet headers. In doing so, the only hosts that can 
be identified are the firewall and any external hosts. 

One final point that relates to all the above technologies is that application level banners within the 
data stream will by bass any firewall. 


Traditional Measures 

One of the first and most important steps in security is to ensure operating systems maintain patch 
levels. In regards to the ability to remotely identify an operating system, patches are vital. Once the 
vendor and version are obtained potential attackers can await new vulnerabilities before returning. 
In such cases warnings of eminent attack will be non-existent. Once the underlying operating 
system patches have been installed, all vendor specific responses from applications should be 
reviewed. In doing so, the only information that may be obtained from your networks is from the 
bastion host. 

At a protocol level, filtering all UDP/TCP requests to non-essential services is always a good 
security measure. It may be possible in certain cases to block all UDP packets. Filtering certain 
ICMP messages is useful but remember not to be too restrictive as several ICMP messages are 
required for efficient IP connectivity. 

Developing access lists on routers for the above protocols are important but knowing your routers 
capabilities is also vital. Certain brands of routers provide ICMP responses when packets are 
dropped. In addition, not all TCP packets are dropped when "deny" access lists have been installed. 


Conclusion 

The possibility of remote operating system identification has been shown and possible defences 
have been outlined. Without supplying specific details of operating system differences or which 
tests to apply, it has been shown that it is possible to identify which operating system is running on 
a host remotely. It has been noted that this in itself is not a vulnerability, but it is believed that such 
information does pose long term security issues. 


Bibliography: 

[1] "Using the Domain Name System for System Break-ins", Steven M. Bellovin, presented at 
the 5th UNIX Security Symposium 1995. 

[2] "CERT Summary CS-98.06", June 11, 1998. 


148 


Remote Operating System Identification 



[3] "TCP/IP Illustrated Volume 1 - The Protocols", W. Richard Stevens. 

[4] "TCP/IP Illustrated Volume 2 - The Implementation", Gary R. Wright and 
W. Richard Stevens, 1995. 

[5] "Internetworking with TCP/IP - Volume 1, Principles, Protocols, and 
Architecture", Douglas E. Comer 

[6] "Requirements for Internet hosts - communication layers", RFC 1122 

[7] Posting to comp.security.unix by Theo de Raadt, 20 Mar 1998. 


Remote Operating System Identification 


149 



150 


Remote Operating System Identification 



Unix Systems Programming using Java 

Jan Newmarch, Distributed Information Laboratory 
Information Sciences and Engineering, University of Canberra 
jan@ise.canberra.edu.au, http://pandonia.canberra.edu.au 

1. Introduction 

The Unix systems programming API has been standardised by a large number of POSIX standards 
documents from the IEEE. For the majority of programmers the most important of these is POSIX 
1003.1 which covers process primitives and environments, directories and files, I/O primitives and 
device specific functions. Later standards cover other issues, such as Shells and Utilities (POSIX 
1003.2) and Administration (POSIX 1387.2). 

One of the unusual characteristics of POSIX 1003.1 is its use of the C language throughout. 
Whereas IEEE standards would normally be expected to be language independent, the history of 
Unix is so bound up with C that this was the only feasible route. The document claims that it will be 
the basis for definitions that are independent of particular programming languages. In fact, new 
language dependent versions have arisen, such as Ada (POSIX 1003.5) and Fortran (POSIX 
1003.9). 

Irrespective of the formal standards mechanisms, designers of libraries for other languages have 
used POSIX 1003.1 as the specification for their own libraries. For example, tel [Ousterhout] and 
Perl [Wall] have well-developed Unix bindings that copy the syntax, and rely on the semantics, of 
POSIX 1003.1. 

Java [Arnold] is the latest flavour in programming languages for all sorts of excellent reasons, 
despite some serious shortcomings. It is a full O/O language (although not as pure as SmallTalk or 
Eiffel); it is based on C/C++, but avoids the worst of both languages (not completely, though); by 
use of a virtual machine it is cross-platform (although some vendors have variant versions); it has a 
rich set set of libraries that deal with, for example, graphics and networking. These are combining 
to produce the ‘ ‘Web programming’ ’ environment where applications or applets are written once to 
run on any platform. Java extends the open authoring environment of HTML to an open 
programming environment. 

How do Unix and Java co-exist? Very well in general terms. A Java application can do exactly the 
same things in a Unix environment that it can do in a Mac or Windows environment. With care, 
applications can even be written that transparently handle such issues as forward- versus 
back-slashes for directory separators. 

Java currently lacks the depth of hooks into the POSIX API that tel and Perl have. The standard 
must continue to have this lack or it will lose its universal applicability. However, extensions by 
non-standard libraries can supply this depth, as long as they are done in an appropriate manner. 

This paper reports on a project to make the POSIX API available to Java applications or applets 
using a non-standard library. It encapulates POSIX calls within Java methods for a new set of 
classes in the posix package. 

This project must be contrasted in aims and methods to the Microsoft policy with J++: their attempt 


Unix Systems Programming Using Java 


151 



is to make J++ the best environment for doing Windows programming in Java [J++], and this 
implicitly rejects the notion of a cross-platform environment. It may even turn out to be difficult for 
the programmer using J++ to be certain of how general their code is. The statements from 
Microsoft do not seem to dispel any uncertainty. 

This project, on the other hand, makes it explicit about the introduction of non-standard features, by 
placing all the new classes in a package that is non-standard. No application using the posix 
package will ever pass the “Java Pure” tests. On the other hand, this package will develop by an 
open process method: the source code will be freely available to all, and can be freely modified. If 
interest is high enough, it can be divorced from individual control and given to a suitable 
co-ordinating authority. 

2. What’s in POSIX? 

Most operating system APIs share common functionality. For example, both Unix and Windows 
have means to list the files in a directory and to change to that directory. Some of this common 
functionality has already been captured in the Java class File which can return a list of files in a 
directory. However, the ability to change directories does not appear to be there because it has 
different effects in DOS to Unix (in DOS it changes directories for all processes, but in Unix only 
for the calling process). 

Some of the areas in which the POSIX API supplies extra functionality over the Java Core is given 
in the following (incomplete) table: 


Files 

multiple links 


file status 

access mode 

FIFOs 

Processes 

file creation mask 


execution times 

user id 

signals 

process groups 


3. A POSIX Library 

Java allows an application or applet to be built from application-specific files and from common 
libraries. The Java Core libraries supply a standard set of libraries that can be guaranteed in all 
environments with the Java logo (although at least one vendor is likely to lose this by 
non-compliance with this requirement). The Core libraries cover language features (java. lang), 
general I/O (java, io), useful utilities (java.util), networking (java.net), windowing 
(java. awt), etc, plus techniques such as remote method invocation (RMI) and Java Native 
Interface (JNI). 

Any application or applet can add classes that are written in Java. These can be accessed from local 
copies or downloaded across the network. This is the principle behind the network computer, and is 
captured in the “Pure Java” certification. This validates that an aplication or applet should be able 


152 


Unix Systems Programming Using Java 




to run in any Java-conforming environment. 

The key to all of this is that implementations of the Java Core classes must exist on all 
Java-conforming platforms. It is not often stated, but these Core classes are riddled with native, 
platform-dependent code. They have to be: the AWT relies on the native windowing system, and 
the network classes rely on the native TCP/IP stack. What Pure Java really means is: use no native 
code apart from that sanctioned by Sun. 

The Java Virtual Machine interpreters are currently written in C. There is an interface between 
compiled Java and the interpreter which is formalised by a set of Java to native-code C functions. 
Regrettably, there are several sets of interface APIs: the original set from Java 1.0, which is still 
heavily used but not talked about anymore {eg in the AWT); the Java Native Interface (JNI) which 
is the new public standard [JNI]; the Microsoft set, which does not support JNI and is one of the 
points of contention between Sun and Microsoft. 

This POSIX library is implemented using the Java Native Interface (JNI) API. It is a “thin” 
library, calling directly into the C API with very little extra work. It will need to be compiled for all 
native platforms, since it is written in C as well as Java. The intent is that it should run on all Posix 
platforms, but of course this has still to be tested. Windows NT presents an interesting case: it has a 
POSIX compliant subsystem but without Microsoft support for JNI access to this subsystem it will 
only be available by the runtime engines of other vendors. 

4. Language Differences 

4.1 Return Values 

The POSEX system calls in C use the return value of functions for a variety of purposes. For 
example, in open( ) a “handle” to the file (a file descriptor) is returned for all future references to 
the file; in umask () the previous value of the mask is returned. 

In Java, these uses may be irrelevant or deprecated. Handles to structures are rarely needed since in 
Java one is dealing with objects, and the reference to an object is a handle to that object. Special 
integer values (for example) are not needed. 

Style issues may cause some problems. The Java Beans specification encourages the use of 
“getter” and “setter” methods. For example, the function umask () in C returns the old mask while 
also setting a new one. In Java Beans style, this would be two separate methods, getumask () and 

setUmask(). 

4.2 Data Types 

There are some data representation issues that are difficult to resolve. Java has a more limited set of 
integer types, and does not include the unsigned values. Mostly, this does not seem to be a serious 
problem: in writing a Java program there is no possibility of overflowing a signed type into an 
unsigned one. However, if a Java runtime engine became accesible to a C program an overflow 
from unsigned type to signed could occur. 

Java does not allow mechanisms to perform address arithmetic. There does not seem to be any need 
to do this in the POSIX API, so again this is not a problem. 


Unix Systems Programming Using Java 


153 



4.3 Error handling 

Systems calls may succeed or fail. There must be a way of signalling both cases. In C, the standard 
method to do this is by the function return value. For example, the function open () will return a 
non-negative integer (the file descriptor) if the call succeeds, or minus one if it fails. 

In Java, errors are signalled by raising exceptions, and if no exception is raised then the method has 
succeeded. 

4.4 Decisions 

For a first cut at a Java/POSIX binding all decisions were made in favour of the C programmer. 

This was based on the fact that the design decisions for the API had already been made and it was 
relatively easy to take these decisions unchanged into Java. It also allowed concentration on the 
(initially) “harder” bits which involved using the JNI API to map Java into C. 

However, the resulting code did not fit the Java coding styles or standards. Programs using this had 
an awkward feel. It was discarded in favour of a more Java-like solution, reworking the POSIX 
API. Some of the possibilities are discussed in the sequel as to why this decision was made. 

5. Class Structure 

Related to the language issue is the classification of POSIX calls into Java classes. C++ can adopt 
the C API unchanged since it is a mixed OO/procedural language that will accept C functions 
unchanged. In Java, all functions must belong to methods of classes. This means that some changes 
must be made to the API anyway. This section discusses some of the issues in this. 

The POSIX specification chapter on “Files and Directories” contains function calls in a number of 
categories. There is a set of functions that merely take the name of a file: these include chmod () and 
access (). Another set refer to a file as object by using a file descriptor, such as open () , creat () , 
read () and write ( ). The C system calls stat () and f stat () take a structure and fill values in. 
The functions link () and unlink () really refer to file names within directories, and as such refer 
to directories rather than files, even though programmers tend to think of them as file rather than 
directory operations. The function umask () sets file creation flags for the current process, and is 
unrelated to files except in their creation. 

The other chapters in the POSIX 1003.1 specification contain an equal mixture of categories. In 
other words, this specification is not drawn up on lines that are suitable for 00 languages. 
Categorisation of C functions into OO classes and methods is not clear-cut. 

5.1 Class Stat 

A set of functions that map well into Java are the functions stat () and f stat () as they deal with 
the stat structure. The C stat structure has a number of fields and a set of C macros that act on 
one of these fields: 

struct stat { 
mode_t s t_mode; 
ino_t st_ino; 

} 


154 


Unix Systems Programming Using Java 



/* constants */ 

#define S_IRUSR ... 

#define S_IWUSR ... 

/* macros */ 

#define S_ISDIR(mode) ... 

#define S_ISCHR(mode) ... 

This displays a number of characteristics of POSIX: 

• Structure field names have a prefix st_ to label them as belonging to structure stat 

• Native data types are hidden by typedef definitions, such as mode_t 

• To reduce name clashes, defined constants have a prefix s_ 

• To reduce name clashes, macros have a prefix s_ 

The functions that manipulate this structure are stat () and f stat (). These fill in the fields with 
values that can be queried by a program. These system calls cannot be used to set values. 

In Java terms, many of these devices for avoiding namespace problems can be dropped, since Java 
has its own mechanisms. The read-only nature of the fields can also be specified. The difference 
between the two system calls that fill in the stat structure correspond to two different constructors. 
However, the type synonym problem is not handled well, and may need to revert to base types. A 
Java class for this structure and its functions could be 

public class Stat { 

// constants 
public final int RWXU; 
public final int RUSR; 
public final int WUSR; 

/ / etc 

// constructors 

public Stat(String fileName) throws PosixException; // stat() 
public Stat(posix.File file) throws PosixException; // stat() 
public Stat(posix.OpenFile fd) throws PosixException; // fstat() 

// public field access 
public int getMode(); 
public int getlno(); 
public int getDevO; 

// etc 

public boolean isDir(); 
public boolean isChr(); 

// etc 

} 

Note also that the constants such as rusr are not assigned their “well known” values (such as 
0400) in these definitions. The actual values are not part of POSIX and so should not be part of the 
Java class definitions. The technique of “blank finals” is used here, and is discussed later in 
implementation issues. 

5.2 Class File 

File operations fall into two sets: those that require only filenames (such as unlink ()), and those 
that require open files (such as read ()). 


Unix Systems Programming Using Java 


155 



The operations that only use filenames can form a class of their own. The Java Core library has a 
class java. io. File that plays a similar role in the platform-independent world. The Java class 
shows a range of decisions that can be made in this mapping. 

1. C functions could map onto static class methods of Java. This would make them like most 
like the C calls. The methods could return the same values as the C calls, handled in the same 
way as in a C program. This would produce Java code such as 

if (File.unlink("/etc/passwd") == -1) 


2. C functions could map onto static class methods, but error handling would be done through 
exceptions rather than return values 

try 

File.unlink("/etc/passwd"); 
catch(Exception e) 


3. Since each of the methods deals with a single file by name, a more 00 approach would create 
an object with this name and call methods on it. Since we probably will not need such objects 
to have long-term existence there will often be no need to save them in variables. Using this, 
and returning C style error values gives 

if (new File("/etc/passwd").unlink() == -1) 


4. Finally, using objects plus exceptions gives 

try 

new File("/etc/passwd").unlinkO; 
catch(Exception e) 


The first version will be most familiar to C programmers. This is the way that entrenched C 
programmers can still program in C while calling it Java. The last is preferred Java style. A Java 
binding can certainly be produced using the first method, and was done so in an early version. 
However, this was discarded in favour of the last method on the grounds that a binding that does 
not “feel right” would ultimately fail. 

5.3 Class OpenFile 

The read () and write () functions act on open files. A file opened only for reading cannot be 
written to, and vice versa. Safety would divide open files into two classes, readable files and 
writable files. This is done in Core Java with classes inputFilestream and outputFilestream. 
However, in Posix a file can be opened as both readable and writable. This is a strong candidate for 
a set of classes using multiple inheritance 


156 


Unix Systems Programming Using Java 



OpenFile 

ReadFile WriteFile 



ReadWriteFile 


Regrettably, Java does not have multiple inheritance so this does not map very nicely. 

Despite duplication of code, the class ReadWriteFile must also be a direct child of OpenFile. This 
common parent class is needed for a number of reasons: 

• common constants such as o_nonblock should be defined in a parent class 

• common methods such as close () should be defined in a parent class 

• some Posix calls use a file descriptor that does not distinguish between files opened for read, 
for write or for both, such as f stat (). Such calls (when turned into methods) will need a 
generic object of any of these types. 

The class structure is thus 


OpenFile 



ReadFile ReadWriteFile WriteFile 


with details 

public class OpenFile { 

public static final int NONBLOCK; 
// etc 

public void closet); 

/ / etc 


public class ReadFile extends OpenFile { 
public ReadFile(String path); 
public int read(byte buf[], int nbyte); 

} 

public class WriteFile extends OpenFile { 
public WriteFile(String path); 
public int write(byte buf[], int nbyte); 

} 

public class ReadWriteFile extends OpenFile { 




Unix Systems Programming Using Java 


157 



public ReadWriteFile(String path); 

public int readfbyte buf[], int nbyte); 
public int write(byte buf[], int nbyte); 

} 

5.4 Class Pipe 

Posix uses the un-named pipe as primary means of inter-process communication, although there are 
plenty of other methods: named pipes, streams, shared memory, ports, etc. A pipe creates a pair of 
I/O channels, one used for read and one for write. A C pipe returns an array of two file descriptors. 
The programmer has to remember that index zero is used for reads, in analogy to file descriptor 
zero being the read descriptor. Similarly for index one, used for writes. 

A Java version of pipe () can define a class that avoids the potential problem of poor programmer 
memory by making type-safe versions in a class Pipe: 

public Class Pipe { 

public Pipe(); 

public ReadFile in(); 
public WriteFile out(); 

} 

The two public methods are both “getter” methods that return the appropriate openFile. However, 
by returning a subclass, it is only possible to use “read” methods on in () and “write” methods on 

out ( ). 

5.5 System call dup2 () 

A typical systems programming activity is to create a pipeline of two separate processes, such as 
Is | wc 

In order to do this in C, it is necessary to create a pipe, perform a fork () , and then map the ends of 
the pipe onto the standard file descriptors using dup () or preferably dup2 (). 

The dup2 () C function call takes two parameters, the current file descriptor and the file descriptor 
it would “like to be”. Typically one of the desired file descriptors is zero (for standard input) or 
one (for standard output). The “current file descriptor” will be an OpenFile object. The desired 
file descriptor could be an integer as in the C call, or could be another OpenFile object, since this 
binding is deprecating the need for an integer file descriptor. After the dup2 (), reads/writes to 
either object should have the same effect. 

For this to function seamlessly, there will need to be existing objects corresponding to file 
descriptors 0, 1 and 2. These can be done in a similar manner to the Core streams System, in and 
System.out. 


These all give further fields and methods of class OpenFile: 

public class OpenFile { 

protected native void dup2(int fd); 
public void dup2(OpenFile f) ; 


158 


Unix Systems Programming Using Java 



} 


with the standard files in a general Posix class 


public class Posix { 

static public InputFile stdin; 
static public OutputFile stdout; 
static public OutputFile stderr; 

} 

This means that we can now perform operations such as 

ReadFile f = new ReadFile("/etc/passwd"); 
f.dup2(stdin); 

f.read(buf, 128); // reads from /etc/passwd 

stdin.read(buf, 128); // reads more from /etc/passwd 


5.6 Class Process 

Java already has threads, and in practice they sit on top of a threads package which may be supplied 
from the O/S or be a separate package. These allow multiple Java threads to share memory and 
objects, but otherwise run disjointly. 

The traditional multi-tasking model in Unix has been via processes, which are more heavyweight 
and do not share memory (although they share some things, such as file descriptors). Processes are 
identified by a “process id” which is an unsigned integer. Each process has its own process id, and 
all but the initial process will have a parent process id. 

There are a host of functions that work on the current process, such as getpid () and getuid (). 
Functions that work on other processes, such as kill () , can only use the process id since that is all 
the information that will typically be available. 

In any running Java application there can only be one process (although there may be many 
threads). It hardly seems worthwhile creating an object of type Process if there can only be one of 
them. So instead of being able to create objects of this class, it is easier to define them as all being 
static methods of the class. 

This binding is not complete for this class. A partial definition is 

public class Process { 

public static int execvp(String file, String args[]); 
public static int fork(); 

} 

5.7 Class PosixException 

A PosixException is raised when an error occurs. This contains an error message as produced by 
strerror(). 

6. Examples 


Unix Systems Programming Using Java 


159 



6.1 Copying input to output 

Copying from standard input to standard output may be done by 


final int SIZE = 1024; 
byte buf[SIZE]; 
int nread; 
try { 

while ((nread = Posix.stdin.read(buf, SIZE)) != 0) 

Posix.stdout.write(buf, nread); 

} catch(PosixException e) { 

System.err.println("Error in copy " + e.toString()); 

} 

The while loop terminates on end of file. Any error is caught by the exception handler. This 
separates the two possibilities which are often combined in the C code of while ((nread = 

read(...)) > 0) 

6.2 File information 

Check read permissions to current directory: 

Stat currDir; 
try { 

currDir = new Stat("."); 

} catch(PosixException e) { 

System, out .println ("Stat failed on . " + e.toString()); 

System.exit(1); 

} 

int mode = currDir.getMode(); 
if (mode & Stat.ROTH) 

System.out.println(". is world readable"); 

6.3 A Pipeline 

It is common in Posix systems programming to set up a pipeline, to fork and to execute another 
program within one or both of the processes. To perform the pipeline 


Is | wc 

in C can be done by (with no error checking) 

int pfd[2]; 

pipe(pfd) ; 
if (fork() == 0) { 

close(pfd[l]); 
dup2(pfd[0], 0); 
close(pfd[0]); 

char *args[] = {"wc", NULL}; 
execvpC'wc", args); 

} else { 

close(pfd[0]); 
dup2(pfd[l] , 1) ; 
close(pfd[l] ) ; 

char *args[] = {"Is", NULL}; 


160 


Unix Systems Programming Using Java 



execvpC'ls", args) ; 

} 

The classes given in this binding allow this to be done in Java by 


Pipe p = new Pipe(); 
if (Process.fork() == 0) { 
p.in().close(); 
p.out().dup2(Posix.stdout); 
p.out().close(); 

String args[] = new String[] {"wc"}; 

Process . execvp ( 11 wc", args) ; 

} else { 

p.out().close(); 

p.in().dup2(Posix.stdin); 

p.in().close(); 

String args[] = new String[] {"Is"}; 

execvpC'ls", args); 

} 

7. Implementation Issues 

7.1 Constants 

POSIX defines all constants symbolically. For example, although “everybody knows” that the file 
mode “readable by user” has the value 0400, POSIX does not define it as such, but only by the 
symbolic value s_irusr. This is entirely correct: the actual value may depend on the platform and 
should not be pinned down in this standard. 

This means that constants should not be hard-coded in the Java binding, but instead picked up from 
the local environment. In Java 1.0 this would only be possible for variables, and allowing constants 
to be variable would not be a good idea. From Java 1.1, constants (declared as final) need not 
have their value statically defined, but can be assigned to once, in an initialisation section such as a 
constructor. This allows a native code call to assign a value. 

Alas, things are not quite straightforward. The Java compiler must ensure that the constants are 
assigned a value, and this cannot be hidden in native code inaccessible to the compiler. A dummy 
value must be used: 


public static final int RUSR; 
private static int rusr; II dummy vbl 

protected static native void initialiseConstants(); 

static { 

initialiseConstants(); // sets rusr, etc 
RUSR = rusr; // yuk 

} 

7.2 Types 

In a similar vein, POSIX uses typedef ’s to hide native data types. For example, mode_t is usually 
a synonym for unsigned short. This could be a different type (although it is unlikely). Java does 


Unix Systems Programming Using Java 


161 



not have a mechanism for typedefs. It would be possible to wrap these primitive types up in classes 
such as integer, but this is a heavyweight solution. What is really needed for POSIX is a means in 
Java of picking up a primitive type definition from native code. The determination of primitive 
types from the native environment is extremely unlikely to occur. One of the nice things about Java 
is that it pins down the primitive types even to bit lengths (ints are 32-bit unsigned), and does not 
allow the variations of 64-bit on modem processors to 16-bits on 8086 processors. Allowing this 
variation, or even allowing a typedef mechanism that can be used as a synonym just will not 
happen. 

The attempted divorce of primitive types from POSIX types will not be met in Java without a great 
deal of effort. 

7.3 Varargs 

C allows functions to have a variable number of arguments, such as print f (). This is not allowed 
in Java. This affects a few Posix functions such as execi () which cannot be implemented in Java. 
Fortunately, there are alternatives such as execvp ( ) which replace the varargs with an array of 
args. 

8. Conclusion 

This paper has shown that it is possible to map a non-trivial subset of the POSIX 1003.1 API into 
Java classes and method calls. The mapping is currently incomplete but there appear to be no real 
problems in completing it. 

Mapping POSIX constants into Java constants is messy but straightforward. However, mapping 
POSIX data types into Java causes problems that will not be resolved easily. 

For this project to move forward will require a consensus effort, since there are many issues 
involved. It is currently being taken up by a project student, to flesh out the binding and to resolve 
issues. The library is available under an “open source” license from 
http://pandonia.Canberra.edu.au/j ava/posix/. 

9. References 

[Arnold] 

K. Arnold and J. Gosling, The Java Programming Language, Addison-Wesley, 1996 

[J++] 

VisualJ-H- Home Page, http://www.microsoft.com/visualj/ 

[JNI] 

Java Native Interface, http://java.sun.com/products/jdk/L2/docs/guide/jni/ 

[Ousterhout] 

J. K. Ousterhout Tel and the Tk Toolkit, Addison-Wesley, 1994 
[Wall] 

L. Wall and R. L. Schwartz Programming Perl, O’Reilly, 1993 


Jan Newmarch (http://pandonia.canberra.edu.au) 

jan@ise.canberra.edu.au 

Last modified: Tue Jul 28 10:30:07 EST 1998 


162 


Unix Systems Programming Using Java 


Copyright ©Jan Newmarch 


Unix Systems Programming Using Java 


163 



164 


Unix Systems Programming Using Java 



T/int HEWLETT 
mL'/iM PACKARD 


HP-UX 11.00 and futures 


Executive summary 1 

Hewlett-Packard's HP-UX enterprise operating environment leads 
the market by many key measures: 

• breadth of functionality in meeting the IT demands of the 
enterprise, 

• market share, 

• breadth of ISV 2 application portfolio, 

• standards compliance, and 

• industry consultants' acclaim. 

HP-UX 11.00, which began shipping in November, 1997, is HP's 
complete 64-bit operating environment that delivers tremendous 
scalability and performance for demanding applications. When 
teamed with HP's leading server systems, HP-UX 11.00 provides the 
power of supercomputing at a fraction of the cost. HP-UX 11.00 
on HP's V-class server delivers the industry's highest 
performance results. 

This paper provides an overview of HP-UX, and describes the key 
features and functionality of HP-UX 10.20 and 11.00, HP's two 
strategic HP-UX platforms. It then outlines HP's future plans 
for HP-UX by providing a roadmap that describes mapping of 
hardware support and selected OS features over the next year and 
a half, culminating in the introduction of HP's IA-64-based 
systems. 3 

Setting the stage: HP’s presence in the UNIX market 

According to industry analyst Aberdeen Group, HP's share of the 
commercial RISC/UNIX server market was 52% in 1996, and its 34% 
revenue growth was 6 percentage points greater than that of the 
market overall. 4 


1 The information contained in this paper presents both current aspects of, 
and future plans for, HP-UX. Such future plans are subject to change 
without notice. 

2 ISV stands for independent software vendor, or an application developed and 
sold by one. Examples are Baan and PeopleSoft. 

3 IA-64 stands for Intel Architecture, 64-bit, a processor architecture 
jointly developed by Hewlett Packard and Intel. 

4 Aberdeen Group, Market Viewpoint, Volume 10/Number 1, February 3, 1997 



HP-UX 11.0 and futures 


165 


HP-UX 11.00 and futures 


2 


Hewlett-Packard Dominates the 
Commercial RISC/UNIX Market 


t 


! 1996 Worldwide Commercial Multi-ueer RISC/UNIX Market 



5.0% ’Aberdeen'a research 
shows that HP continues 
to deliver products and 
aerricea that bast meat FI 
executives' requirements 
for building their 
next-generation 
Information 
infrastructures.' 


H HEWLETT* 

PACKARD 

Hewlett Packard's HP-UX-related revenues— covering server and 
peripherals hardware, networking products, HP-UX, HP applications 
software, and services- amounted to more than $10 billion in 
1997, good for a quarter of overall HP corporate revenues. 

HP's dominant position in the commercial RISC/UNIX space is the 
result of the firm's R&D investments that lead to its providing 
customers with industry-leading products and solutions throughout 
the value chain, ranging from PA-RISC processors, to systems and 
networking, to HP-UX and middleware, to support services. 

HP's large share in this market reflects its longstanding 
investment in HP-UX, to which it is very committed. HP spends 
several tens of millions dollars annually 5 on the development of 
HP-UX and related application software. Several thousand software 
development engineers are devoted to this effort. In addition to 
its own commitment to HP-UX, HP has strategic relationships in 
place with several industry partners for technology sharing and 
exchange with regard to HP-UX. These partnerhips, with Hitachi, 
NEC, and Stratus, have the aim of further extending HP-UX's 
strengths in the high end of the UNIX market. 

HP-UX 11.00 constitutes a major advance for HP-UX, and indeed all 
of UNIX. Industry analysts, ISVs, and end customers appear to 
agree. Gartner Group, the industry's leading industry analyst, 
reviewed the major enterprise operating systems, and ranked HP-UX 
the highest. The chart below summarizes Gartner's analysis. 6 


5 The exact annual expenditure is company confidential. 

The criteria Gartner Group used in assessing the Operating Systems it 
reviewed were focus and position; investment; resources; competency, ISV 
enthusiasm, integration, high availability, upgrade; 

administration/management; and commodity. To access Gartner's full reports, 
see www.gartner.com/public/axl/reprints/k210326.html and 
www.gartner.com/public/axl/reprints/k210327.html 


166 


HP-UX 11.0 and futures 










HP-UX 11.00 and futures 


3 


Gartner Group ranks HP-UX the 
best Operating Environment 



Gartner Group, 
Research Note, 
December 1997 




HEWLETT* 

PACKARD 


In addition to industry consultants' favorable reviews, HP-UX 
11.00 is attracting a growing list of ISV applications. Since 
its launch, the major database firms and other leading ISVs have 
developed versions of their software for HP-UX 11.00, maximizing 
its functionality. 


Overview of HP-UX 

HP's PRISM framework describes how HP-UX meets the mission- 
critical needs of the large enterprise. PRISM stands for 
Performance and scalability; Resilience, or high availability; 
Integration; Security; and Manageability. PRISM applies to both 
the core operating system and complementary HP applications 
software, or middleware, that together comprise the HP-UX 
operating environment. 


PRISM components of the 
HP-UX Operating Environment 



Performance 
and scalability 

Resilience 
and high 
availability 

Integration 

Security 

Manageability 

Middle- 

•Optimising 

* Multi-system 

•Colliance Program 

* Prae8ldlum 

* Process 

compilers 

High Availability 
MC/ServiceGuard, 

lor Integration ol 
HP-UX, NT 

family of security 
functions: Single 

Resource Mgr 
* GlancePlus 

ware 


MC/LockManager’ 

Enterprise 

Clusters 
•Online JFS 

* Domain Internet 
Platform 

* Enterprise 
Networking 

* Application 
Development, incl. 
Java 

Si{ 7 i On, 
Authorization, 
Authentication 
* B1 security 

* OpenView family 
of Systems Mgmt 

* Distributed 

Intranet Services 
Incl. DCE 


• SMP 

* Quality 

■ Web-enabled OS 

* C2 compliance 

* System 

HP-UX 

•64 bits 

* Robustness 

1 Interoperability 

* Cryptography 

Administration 

• Kamel Threads 

• Single-system 

with NT. Netware, 

APIs 

Mgr 

• Paging 

• Fibre Channel 

* Native Java 
•Web 

* Networking 

High Availability: 
Dynamic Memory 
Resilience 

* Fast Recovery 

* Journaled File 
System 

DCE, other UNIXs 

* Encryption, 
secure network 
connections 
through IPSec 

•SD/UX 

* IgnHeAJX 

* Distributed Print 
Service 

* Instant Ignition 

* DCE Client 


rn« HEWLETT* 
|tI PACKARD 


HP-UX 11.0 and futures 


167 





HP-UX 11.00 and futures 


4 


Extensions to PRISM— PRISM plus— include investment protection, 
and breadth of the applications portfolio. HP-UX 11.00 provides 
industry-leading features and functionality in every aspect of 
PRISM. The next several sections of this paper go through the 
PRISM components of HP-UX. 


Performance and scalability 

HP-UX is suitable for a broad range of business information 
technology (IT) environments, from large NFS servers to very 
high-end computing environments. It is also ideal for a broad 
range of hardware platforms, from the single-user desktop 
workstation to mainframe-class Enterprise Parallel Systems (EPS) 
used in very high-end computing environments. The list below 
shows some of the IT environments for which HP-UX is well suited. 

• Processing-hungry applications that need the industry-leading 
performance of HP's high-end V-2250 enterprise servers 

• Applications that can benefit from the performance and 
scalability of a 64-bit environment, such as very large 
databases for on-line transaction processing (OLTP) or 
decision support systems (DSS) 

• Applications using kernel threads or porting of existing 
threaded applications to the HP-UX platform 

• File serving and batch processing, especially with a blend of 
applications 

• Environments where excellent NFS (network file system) 
performance is required 

• Mission-critical applications that require the highest 
possible uptime 

• Internet access or enterprise-wide connectivity (intranets) 

• Security for electronic commerce 

• Mixed-platform environments, especially those with Windows NT 
and NetWare 

Available in a 32-bit version only, HP-UX 10.20, introduced 
August, 1996, was the first version of HP-UX to support the then- 
new 64-bit PA-8000 processor. HP-UX 11.00 comes in 32- and 64- 
bit versions. It is the first 64-bit version of HP-UX and the 
first to support the high-end V-class server. 

The table below summarizes HP-UX's support of HP's 9000 server 
line. D means it depends on the specifics of a given EPS 
configuration. 



10.20 

11.00/32-bit 

11.00/64-bit 
(PA-8x00-based 
systems only) 

D-class 

Yes 

Yes 


K-class 

Yes 

Yes 

Yes 

T-class 

Yes 

Yes 

Yes 

V-class 



Yes 

EPS 

D 

D 

D 


168 


HP-UX 11.0 and futures 



HP-UX 11.00 and futures 


5 


With the 64-bit version of HP-UX running on a 16-way V2250, HP 
achieved the industry's best single-system performance and 
price/performance results, with 52,117 tpmC and $82/tpmC, 
respectively. In addition to the very fast PA-8200 processor, 
this result is largely due to the 64-bitness of HP-UX.11.00. The 
64-bit version does away with the 4 GB memory addressing 
constraint associated with 32-bit operating systems. As a 
result, HP-UX 11.00 can support more and larger applications and 
data sets, avoiding performance-reducing disk swapping. 

While its 64-bitness is a key calling card for HP-UX 11.00, this 
version incorporates some innovative features independent of 64 
bits that contribute to faster performance and greater 
scalability, including 

■ Performance Optimized Page Sizing, which can account for an up 
to 2x performance boost 

■ Kernel threaded functionality, for more granular control of 
parallelized processes, which can contribute to faster process 
execution 

■ NFS version 3 for greater scalability of networked file 
systems, and 

■ NIS+ for faster performance and greater security 

The table below compares the key performance attributes of HP-UX 
10.20 and 11.00. 



10.20 

11.00/32-bit 

11.00/64-bit 

Performance and scalability 

Maximum CPUs supported 

14 

16 

16 

32-bit PA-7x00-based systems 

Yes 

Yes 


64-bit PA-8x00-based systems 

Yes 

Yes 

Yes 

V-2x00 



Yes 

File size, local 

128 GB 

1 TB 

1 TB 

File size, networked 

2 GB 

1 TB 

1 TB 

Physical RAM 

3.75 GB 

3.75 GB 

4 TB 

Shared memory 

2.75 GB 

2.75 GB 

8 TB 

Process data space 

1.9 GB 

1.9 GB 

4 TB 

File descriptors 

60,000 

60,000 

60,000 

User Ids 

2 billion 

2 billion 

2 billion 

Kernel threads 


Yes 

Yes 

Performance Optimized Page Size 


PA8x00 only 

Yes 

NFS v3 


Yes 

Yes 

NIS + 


Yes 

Yes 


Resilience and high availability 


64-bit D-class support is expected to be available in 1998. 

Architected maximum; currently limited by system configuration to 32 GB. 


HP-UX 11.0 and futures 


169 




HP-UX 11.00 and futures 


6 


More and more organizations demand maximal uptime for their 
systems; the cost of down time can be exorbitant, in terms of 
lost revenues and/or customer dissatisfaction. HP offers the 
broadest portfolio of high-availability (HA) software and 
hardware products of the major UNIX vendors. HP's high- 
availability solution focuses on both increasing ROI and reducing 
the risks of loss of application availability and data 
corruption. 

HP's HA solution accomplishes these goals by clustering existing 
HP 9000 UNIX servers in end-users' IT infrastructure to provide 
application and data resiliency. A feature that is a standard 
part of HP-UX's robustness and memory fault tolerance is called 
Dynamic Memory Resilience (DMR) 7 , which tests continuously for 
faulty physical memory. It automatically prevents subsequent 
addressing, or use, of that failed RAM. The key benefit here is 
process execution protection and avoidance of data corruption. 



10.20 

11.00/32-bit 

11.00/64-bit 

Resilience and high availability 

Journaled file system 

Yes 

Yes 

Yes 

64-bit journaled file system 



Yes 

Dynamic Memory Resilience 

Yes 

Yes 

Yes 

Fast memory dump 


Yes 

Yes 

Dynamically Loadable Kernel Modules 


Yes 

Yes 


Both HP-UX 10.20 and 11.00 have journaled file systems and DMR, 
which contribute to reduced unplanned downtime. HP-UX 11.00 adds 
to these features Dynamically Loadable Kernel Modules, which 
provides an infrastructure to reduce planned downtime by enabling 
the modification of the kernel without taking the system down and 
rebooting. 

Complementing this HP-UX resilience functionality is HP's 
industry-leading HP's clustering environment, MC/ServiceGuard, 
which employs an auto-reboot facility for loosely coupled 
servers, so that if one server, which has been executing mission- 
critical applications, fails, several other production servers 
within the MC/ServiceGuard environment can pick up the extra 
workload. Additionally, MC/ServiceGuard provides availability of 
mission-critical applications during hardware and software 
maintenance thus minimizing the need for unwanted planned down 
time. For example, operating system or processor upgrades for a 
server within a cluster can be done without having to bring down 
the cluster. Workload can be shifted to other clustered servers 
during these operations. 

MC/ServiceGuard also monitors system processors, system memory, 
LAN media, LAN adapters, system processes and application 
processes, and it responds to failures in order to restore the 
application service to LAN-based clients in the matter of a few 
minutes. Other HA products from HP also ensure data 


Earlier referred to as Memory Page Deallocation. 


170 


HP-UX 11.0 and futures 




HP-UX 11.00 and futures 


7 


availability, process management, on-line storage optimization, 
load balancing, etc. 

HP's leadership in the UNIX clustered market is emphasized by 
services such as Critical Systems Support and Business Continuity 
Support services, which are tailored to customers' availability 
needs. From availability of inventory to preventative 
maintenance, these services provide the proper level of 
responsiveness for customers' computing environments. 


Integration 

In the open systems world, an important requirement is to provide 
enterprise-wide connectivity into all major environments. In 
addition, with the tremendous growth of the Internet and company 
intranets, it is imperative that an operating environment have 
integrated access to these important technologies to facilitate 
product and service 

advertising, information flow, and electronic_commerce. HP-UX 
11.00 is distinguished by its high degree of integration with the 
Internet, other operating environments, and with such network 

operating systems as Windows NT.® 

HP-UX's high degree of integration is enabled through several 
means. 

• Internet and intranet connectivity* hp-ux 11.00 incorporates a broad 
range of Internet services. These services enable any HP 9000 
Enterprise Server to operate in a TCP/IP-based 

Internet/intranet or other corporate environment that requires 
business use of distributed computing technology. 

• The HP Colliance Program enhances HP-UX's ability to coexist and 
integrate with mixed environments, ensuring that IT 
administrators can combine HP-UX with NT and NetWare across 
their organizations. 

• Netscape FastTrack*: Provides a complete solution for creating 
and managing Web sites on the Internet or an intranet 

• Oracle® Web Application Server* : Enhances Web servers with 
sophisticated features required for building and deploying 
real business applications on the Internet 

• HP-UX Java™ Virtual Machine, Just-in-time compiler for Java, Java SDK*: 

Allows developing and deploying platform-independent Java 
applications. 

Security 

HP-UX 11.00 has a number of security features to give customers 
comfort that their systems are secure from external— or 
internal— tampering. (A complete list is included in the 
Appendix.) HP-UX conforms with US Department of Defense C2 
security requirements, to provide heightened server security, and 
includes system auditing, to improve user accountability and 


These are bundled for free with HP-UX. 


HP-UX 11.0 and futures 


171 



HP-UX 11.00 and futures 


8 


deter unauthorized activities. A B1 (i.e., trusted mode) version 
of HP-UX is also available. 

Manageability 

The system management facilities supported by HP-UX are designed 
to easily manage both single-server and complex networked 
systems, reducing total cost-of-ownership and increasing the 
productivity of your IT staff. 

• System Administration Manager (SAM): Provides easy, graphical-based 
management of system resources 

• Software Distributor : Has robust, standards-based software and 
application distribution and updating 

• Ignite/UX: Enables customizable initial system configuration 


PRISM plus 

The PRISM framework discussed thus far captures five of the key 
groups of features and functionality of HP-UX 11.00. Additional 
attributes beyond those covered by PRISM include standards 
adoption, investment protection, application development, and 
breadth of applications portfolio. 

Breadth of applications portfolio 

HP-UX is the number one solutions platform in the UNIX market, 
with over 14,000 applications running on HP-UX developed by 
independent software vendors (ISVs). Customers can confidently 
choose HP systems, knowing that they have a breadth of business- 
critical applications available to apply to their business needs 
throughout the enterprise. It is impractical to show HP-UX's 
standing for all 14,000 of these applications. However, as the 
figure below shows, HP-UX is #1 platform for Oracle and Informix 
by a substantial margin, and is the second-highest platform for 
Sybase's sales. 


HP-UX is the Overall Leading 
Platform for the Top 3 Database 
Vendors 



Sybase 
' QRACLC 


1 ±sm 


d i g i tjajl 


ca 


HEWLETT* 

PACKARD 


172 


HP-UX 11.0 and futures 




HP-UX 11.00 and futures 


9 


HP-UX's status as platform revenue leader applies to many other 
leading ISV applications: SAP, Baan, PeopleSoft, D&B, and many 
others. Eight of the world's top ten ISVs have HP-UX as their #1 
revenue platform. 

Several factors contribute to HP-UX being the leading platform 
for most ISVs. First is HP's leading market share in the UNIX 
server market (recall the market share chart on page 2). ISVs, 
seeking to leverage their development efforts and costs, 
naturally gravitate toward the HP-UX platform and its large 
installed base, to maximize sales opportunities. 

Second, through its Developer Alliance Lab (DAL), HP has 
developed and maintains very strong and close relationships with 
larger ISVs. HP has similarly strong ties with other ISVs 
through the HP Partnership Program, which closely ties HP with 
resellers and systems integrators. These include equipment 
programs, technical support, and web-based information. HP 
shares with them early versions of HP-UX, so that they can best 
take advantage of new features and functionality in HP-UX and 
incorporate them into their applications early on: at or shortly 
after the release of HP-UX. Customers gain by having access to 
new application functionality soonest on HP-UX than on other UNIX 
platforms, which can give them a competitive advantage. 


HP-UX 11.0 and futures 


173 



HP-UX 11.00 and futures 


10 


Investment protection 

Investment protection has been a hallmark of HP-UX since it was 
first introduced in 1986. Over time, customers can spend up to 
twice as much on software as they do on hardware, in terms of 
people and dollars, and they want to be able to protect their 
substantial investments in applications software. HP-UX has the 
strongest record in the industry for providing application 
investment protection through forward binary compatibility, in 
which a fully-bound application developed on an earlier or 
current version of HP-UX is ensured to run smoothly on future 
versions of HP-UX. 

As a result, customers' investments in application software are 
protected by extending the useful life of their applications 
across several versions of HP-UX. HP ensures that 32-bit PA- 
RISC applications following well-documented programming 
techniques will run on IA-64-based platforms without requiring 
modification. 


Forward Binary Compatibility 
Ensures Investment Protection 
Beyond 2000 


32-blt 

Applications 

rA 

HP-UX 


(32) 


PA 7X00 

PA 8x00 


(32) 

(64) 



1996 

32-bit application . 


32-bit 

Applications 

_i 

IE 

> 64-Bit 
RDBMS 

>4-btt 

I cations 

HP-UX 

(32) 

HP-UX 

(«4) 

PA 7x00 
(32) 

PA 8X00 
(64) 


0 


32-blt 

Apps 

64-bit 

Apps 

1 Optim. 

64-blt 
| Apps 


ZI 

1 

HP-UX 

HP-UX 

: hp-ux 

(32) 

(64) 

(64) 

PA 7x00 

PA 8x00 

HPAntal 

(32) 

(64) 

IA-84 


1997 

(Coexisting 32- and 64-bit Applications) 


1999 and beyond 


object code compatibility 


0 Through wol-defined interfacs# 
(•.g. Pipe*. Sockets, Rto» and 
S ha rod Mamory up to 2.76GB) 


E9 


HEWLETT* 

PACKARD 


Standards 

Openness and standards are key contributors for the growth of 
UNIX as a competitive alternative to proprietary mainframe 
computing environments. HP-UX leads the industry in the support 
and development of UNIX standards. It was the first UNIX to ship 
that is UNIX-95-branded 8 , as specified by The Open Group, the 
standards body 9 . HP-UX also complies with such industry 
standards as CDE, or Common Desktop Environment, HP-UX's 
windowing environment; POSIX; and over 70 others. A complete 
list is included in the Appendix. 


8 HP-UX 10.10, released in February, 1996. 

9 See http://www.xopen.org 


174 


HP-UX 11.0 and futures 





HP-UX 11.00 and futures 


11 


Beyond its commitment to standards adoption and compliance, by 
virtue of its leading presence in the UNIX server market, HP has 
played a significant role in the development of standards for 64- 
bit UNIX. For example, HP's Software Distributor technology was 
the basis of a standard recently endorsed by The Open Group and 
other leading UNIX vendors that permits software platform 
replication and standardization. 

HP-UX futures 

HP has ambitious plans to continue to enhance HP-UX. During the 
remainder of 1998 and through to the end of 1999, HP will 
increase the PRISM functionality of HP-UX as indicated in the 
table below. 


PRISM element 

Feature 

Benefit 

Performance 
and scalability 

• 

Support for PA-8200 V2250 and 
PA-8500-based V2500 

• 

Faster execution of mission- 
critical applications, with 
particular benefits in data 
warehousing and decision- 
support application 


• 

64-bit D-class support 

• 

Access to 64-bit computing at the 
low end; ideal for low-cost 64-bit 
application development 


• 

ccNUMA 

• 

Ability to have a single instance 
of HP-UX apply to a clustered 

SMP environment 


• 

Memory windows 

• 

Enables larger application- 
specific shared memory 

Resilience 

• 

Dynamic Processor Resilience: 
rollover of applications from a 
failed CPU in an SMP 
configuration 

• 

Reduced unplanned down time 


■ 

Online addition and replacement 
of I/O 

• 

Reduced unplanned down time 

Integration 

a 

Java performance 
enhancements 

• 

Faster execution of Java 
applications 

Security 

• 

Journaled File System Access 
Control Lists 

• 

Heightened security for systems 
configured with JFS 

Manageability 

• 

Bundling of CA Unicenter 
Framework and HP Openview- 
ready 

• 

Choice of industry-leading 
systems management 
frameworks 


• 

Java-based systems 
management 

• 

Easier systems management 
using a web browser interface 


As the introduction of IA-64-based systems approaches, HP will 
continue to invest in yet higher degrees of both performance and 
scalability and resilience and high availability. As to 
performance, plans are to support 32- and 64-way SMP systems, 
using both PA-RISC and IA-64 processors. To ensure increase 
systems resilience and high availability, HP plans to incorporate 
in HP-UX protection domains, and on-line replacement and addition 
of memory, and CPU, to complement HP's goal of providing 99.999% 


HP-UX 11.0 and futures 


175 












HP-UX 11.00 and futures 


12 


uptime, which translates to just five minutes of unplanned 
downtime per year. 


Endorsement by strategic partners 

The current consolidation of the UNIX market in anticipation of 
the introduction of IA-64-based systems has been well chronicled 
in the industry press. Last April, HP jointly announced that 
Hitachi, NEC, and Stratus— three prominent hardware vendors— had 
embraced HP-UX as their UNIX of choice for their IA-64-based 
systems. 

Concluding remarks 

The combination of HP's four-year strategic relationship with 
Intel in the joint development of IA-64, HP-UX's current and 
planned new functionality as described above, and the endorsement 
by major industry partners ensure that HP-UX will be the de facto 
UNIX on IA-64. For more information, see www.hp.com/ao/hp-ux . 


176 


HP-UX 11.0 and futures 



HP-UX 11.00 and futures 


13 


Appendix 


Feature comparison of HP-UX 10.20 and 11.00 



10.20 

11.00/32-bit 

11.00/64-bit 

Performance and scalability 

Maximum CPUs supported 

14 

16 

16 

32-bit PA-7x00-based systems 

Yes 

Yes 


64-bit PA-8x00-based systems 

Yes 

Yes 

Yes 

V-2x00 



Yes 

File size, local 

128 GB 

1 TB 

1 TB 

File size, networked 

2 GB 

1 TB 

1 TB 

Physical RAM 

3.75 GB 

3.75 GB 

4 TB 

Shared memory 

2.75 GB 

2.75 GB 

8 TB 

Process data space 

1.9 GB 

1.9 GB 

4 TB 

File descriptors 

60,000 

60,000 

60,000 

User IDs 

2 billion 

2 billion 

2 billion 

Kernel threads 


Yes 

Yes 

Performance Optimized Page Size 


Yes, on 
PA-8x00 

Yes 

NFS v3 


Yes 

Yes 

NIS + 


Yes 

Yes 

Resilience and high availability 

Journaled file system 

Yes 

Yes 

Yes 

Dynamic Memory Resilience 

Yes 

Yes 

Yes 

Dynamically Loadable Kernel Modules 


Yes 

Yes 

integration 

Java VM with JIT compiler 

Yes 

Yes 

Yes 

Netscape FastTrack 

Yes 

Yes 

Yes 

Oracle Web Application Server 

Yes 

Yes 

Yes 

Security 

Standard mode 

Object Reuse 

Yes 

Yes 

Yes 

HFS Access Control Lists 

Yes 

Yes 

Yes 

Restricted SAM 

Yes 

Yes 

Yes 

Large (>60000) User IDs 

Yes 

Yes 

Yes 

Keberos V5 Authentication 

Yes 

Yes 

Yes 

NIS Manageability 

Yes 

Yes 

Yes 

Pluggable Authentication Module 


Yes 

Yes 

NIS+ Manageability 


Yes 

Yes 

Trusted mode 

Encrypted Password Protection 

Yes 

Yes 

Yes 

Boot Authentication 

Yes 

Yes 

Yes 

Long Passwords 

Yes 

Yes 

Yes 


Except D-class 

" Limited by system configuration to 16 GB 


HP-UX 11.0 and futures 


177 







HP-UX 11.00 and futures 


14 



10.20 

11.00/32-bit 

11.00/64-bit 

Password Complexity Checking 

Yes 

Yes 

Yes 

Password Reuse Checking 


Yes 1 

Yes 1 

Password Life Cycle Management 

Yes 

Yes 

Yes 

Login Controls 

Yes 

Yes 

Yes 

Auditing 

Yes 

Yes 

Yes 

C2 Security Compliance 

Yes 

Yes 

Yes 

JFS Support in Trusted Mode 

Yes 

Yes 

Yes 

NIS+ Manageability 


Yes 

Yes 

Manageability 

SAM 

Yes 

Yes 

Yes 

SAM enhancement for security and HA 


Yes 

Yes 

Applications 

32-bit only 

Yes 

Yes 


32- and 64-bit 


Yes 

Yes 

Standards 

AT&T System V Interface Definition 
(SVID3 base and kernel extensions 
subset) Level 1 API support 

Yes 

Yes 

Yes 

Berkeley Internet Name Domain 4.9 


Yes 

Yes 

Common Desktop Environment with 
integrated Technical Print services 
(CDENext) 2.1 


Yes 

Yes 

Dynamic Host Configuration Protocol 

Yes 

Yes 

Yes 

Federal Information Processing Standard 
(FIPS) 151-2 (PI 003.1 + Federal 
interpretations) 

Yes 

Yes 

Yes 

IEEE POSIX 1001.3c Kernel threads 


Yes 

Yes 

IEEE POSIX 1003.1 (POSIX system calls 
and libraries) and 1003.2 (POSIX . 
commands) 

Yes 

Yes 

Yes 

IEEE POSIX 1003.1b Real-time 
scheduler, clocks and timers, 
synchronous I/O, semaphores 

Yes 

Yes 

Yes 

IEEE POSIX 1003.1b Asynchronous I/O, 
Memory Locking, Message Queues, 
Real-Time Signals, Shared Memory 


Yes 

Yes 

NFS+ Diskless and ONC+, NFS+, version 

4.2 

Yes 

Yes 

Yes 

OSF Application Environment 

Specification (AES) 

Yes 

Yes 

Yes 

OSF/Motif Toolkit (MotifNext) New 
Streams-based transport stack for 

TCP/IP 

Yes 

Yes 

Yes 

Streams-based transport stack for 

TCP/IP 


Yes 

Yes 

System V.4 File System Directory Layout 

Yes 

Yes 

Yes 

UC Berkeley Software Distribution 4.3 
(BSD 4.3), including features such as job 
control, fast file system, symbolic links, 
long filenames, and the C shell 

Yes 

Yes 

Yes 


178 


HP-UX 11.0 and futures 






HP-UX 11.00 and futures 


15 



10.20 

11.00/32-bit 

11.00/64-bit 

X/Open Portability Guide release 4 Base 
Profile (XPG4) 

Yes 

Yes 

Yes 

X/Open UNIX 95-branded (conforms with 
the Single UNIX Specification, 

SPEC1170) 

Yes 

Yes 

Yes 

X11 R6.2 Window System, Font Server 
and Clients 


Yes 

Yes 

Year-2000 compliant with patches 

Yes 

Yes 

Yes 


HP-UX 11.0 and futures 


179 









180 


HP-UX 11.0 and futures 






Topics 




•WDM Principle 

* System Considerations 
- Add/Drop Multiplexing 

* WDM Systems 

* Research Directions 


INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


Optical Networking 


181 















INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


Optical Networking 








Increase capacity of installed fibre 

-Needed in some applications 

Provide independence of separate 
channels 

-Clock synchronisation is a significant problem in 
Sonet/SDH systems 

-Many systems don't naturally fit into Sonet/SDH 

Low Cost 

-Especially operational cost 

High Reliability 

-Minimal electronics to go wrong 


COPYRIGHT IBM, JUNE 1998 


INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


Speed 

Limit 


COPYRIGHT IBM, JUNE 1998 


INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


Optical Networking 


183 




Speed Limit? 


Fibre has NO practical capacity limit 


* BU l conventional transmission on 
is limited 

-2.5 Gbps is easy! 

-10 is significanty more difficult 

- Power required doubles every time speed is doubled 

Real power limits exist 

- Every time you double the modulation speed you 
the effects of dispersion! 

Dispersion is a constant regardless of speed so if you double the speed you double 

the effect of dispersion 

Signal bandwidth required increases with modulation speed 

-But up to 40 Gbps possible with current techniques (this is 
argueable) 

- Electronics Capability and Cost an Issue 


INTERNATIONAL 
TECHNICAL SUPPORT 


- - - t 'OP'S 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 






Fibre Amplifier - Principle 




input Signal 
(1500nm band) 


Amplified 
Output Signal 



jtr 



Coupler 


Power 

Laser 


980 nm 
or 

1490 nm 


Fibre 

Containing 
Erbium Dopant 


INTERNATIONAL Harrv T R Dutton copyright IBM, JUNE 1998 

TECHNICAL SUPPORT ^ ' 




• Erbium Doped Fibre Amplifier 

-Simplerthan repeaters 

- Lower in cost than repeaters 
-Amplifies everything 

Not protocol dependent 
Aii wavelengths together 

“Signal gain up to 25db 

-No delay in amplification process 

-Available "window" about 25 nm wide 

1535 nm io 1560 nm 

Some ways of extending to 40nm 

* Amplifying WDM Signals requires careful 

amplifier design 

- Require very flat response 

INTERNATIONAL Harry J R Dutton copyright ibm, June 1998 

TECHNICAL SUPPORT 


Optical Networking 


185 





Types of WDM 




* Simple WDM Systems 

- Been around for several years 
-2 wavelengths only 

either 850 nm and 1310 nm 
or 1310 nm and 1550 nm 
sometimes (rarely) all three 

-Uses simple wavelength selective couplers 

* "Dense” WDM 

-Multiple channels in the 1540-1560 band 
-This is what we are discussing here 


INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


* Create Separate Signals at different 

wavelengths 

* Mix them 

« Transmit them along the fibre 

* Separate them at the destination 

* Receive them 


- 

INTERNATIONAL 
TECHNICAL SUPPORT 



Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


186 


Optical Networking 







System Considerations 


: 

INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


sating Wavebands 





INTERNATIONAL 
TECHNICAL SUPPORT 




Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


Optical Networking 


187 







Transmitters - 

* Always Lasers 

* Need precise control of centre wave!enoth 

* Narrow iinewidth (to fit within hand) 

* Minimise CHIRP 
- Minimise DRIFT 

* Modulate at required speed 

* Temperature control mandatory 

rasKMsrawraw u smm *-u* mmmm mm 

Harry J. R. Dutton copyright ibm, June 1993 


INTERNATIONAL 
TECHNICAL SUPPORT 


Mixing The Signals (Multiplexing) 

.. 

• Simple Couplers 

- lose lots of light 

- low cost 

-probably need to amplify signal 


•Wavelength selective devices 

- Expensive BUT much more efficient 

- Littrow Gratings 

-Array Waveguide Gratings 

- Dielectric Filters 

-Circulators with Fibre Bragg Gratings 


INTERNATIONAL „ 

■ TECHNICAL SUPPORT Harry J. R. Dutton 

COPYRIGHT IBM, JUNE 1998 


188 


Optical Networking 






Transmission Issues 

« Four Wave Mixing 

• Stimulated Raman Scattering 

• Self-Phase Modulation 

-A high power effect 

• Cross-Phase Modulation 

-Another high power effect 

• Stimulated Brillouin Scattering 

-A problem with very narrow iinewidths 

• Dispersion 

- Need dispersion to reduce the eftects of FWM 

- However it limits the distance you can go 

. • t..aw.v *»«• • * ■ammtmm c 

INTERNATIONAL Harrv J R Dutton COPYRIGHT IBM, JUNE 1998 

TECHNICAL SUPPORT ^ ' 


Demultiplexing the Signal 




* Simple Couplers + Individual Filters 

- lose lots of light 

-probably need to amplify signal 

• Wavelength selective devices 

-Expensive BUT much more efficient 
-Littrow Gratings 
-Array Waveguide Gratings 
-Dielectric Filters 

-Circulators with Fibre Bragg Gratings 


INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


Optical Networking 


189 








INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 










INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 




Add-Drop Multiplexing 


INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


Optical Networking 


191 







ilexm< 



INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


192 


Optical Networking 










INTERNATIONAL Harrv T R niittnn COPYRIGHT IBM, JUNE 1998 

TECHNICAL SUPPORT ^ 


WDM Systems 



Optical Networking 


193 





Current WDM Record 


- 


NEC (1996) 

"Hero" Demonstration 

- Research demo - NOT operational system 

132 channels at 20 Gbps = 2.64 Tbps 
-.25 nm (actually 33.3 Ghz) channel spacing 
-35 nm wavelength range (1529-1564 nm) 

120 km span 

Conventional fibre 
-plus dispersion compensating fibre and loss 
compensating amplifiers . 


INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


COPYRIGHT IBM, JUNE 1998 


* Fujitsu (1996) 

-55 wavelengths, 20 Gbps, 150 km, 3 spans 

* AT+T/Lucent (1996) 

-25 wavelengths, 40 Gbps, 55 km 

• NTT (1996) 

-10 wavelengths, 100 Gbps OTDM, 40 km 

• Hitachi (1995) 

-8 wavelengths, 10 Gbps, 1171 km, standard fibre 


asssasa 

INTERNATIONAL 
TECHNICAL SUPPORT 


Harry J. R. Dutton 


mu* - . • ,»r 

COPYRIGHT IBM, JUNE 1998 


194 


Optical Networking 








Optica] Networking 


195 






196 


Optical Networking 








Harry J. R. Dutton 


Workstation 3 


COPYRIGHT IBM, JUNE 1998 


'ptical Wavelength Routing 


Immediate Challenges 

-Must use a singie wavelength end-to-end 
-NOT like ATM where you change VPIA/C! 
on each hop 

-Gets very complex in a large network 
- Probably need to be able to tune both transmitters 
and receivers 

-But possible with amplifiers and moveable gratings 
at crosspoints 


INTERNATIONAL 

TECHNICAL SUPPORT Harry J. R, Dutton copyright ibm, june 1998 


Optical Networking 



















Research Challenges 




• COST 

-Should come down in volume production - but... 

• Tuneable Laser Transmitters 

- Recent reports suggest this is solved 

• Control/Stability of iaser wavelength 

- In manufacture - Solved 

• Speed of tuning of tunable receivers 

-Mechanical actions slow (circa 250 micro sec) 


* Erbium Amplifiers 

-Width of amplified band, response over band, 
crosstalk 



198 


Optical Networking 





ENTERPRISE MIDDLEWARE SERIES 



Making 

Component-based 
Systems Scale 
with BEA M3 

May 1998 


3^ A 

T H E 

Enterprise 

Middleware 

Solution 


Building High Volume Distributed Applications using CORBA 


199 




Contents 


Acknowledgement.. 

Introduction. 

Client/Server Connectivity. 

Connectivity Alternatives. 

Connection Multiplexing. 

Efficient State Management. 

BEA M3 TP Framework. 

State Management Policies. 

Operational Systems Management 

BEA M3 Has the Right Stuff. 

Bibliography. 


200 


Building High Volume Distributed Applications using CORBA 














Acknowledgement 


This paper is based on an article by the author previously published in Volume 12, Report 2 of 
Middleware Spectra (May 1998) under the title, “Issues when making object middleware scalable 
It has been adapted to show how the BEA M3 product uses the techniques described in that article to 
achieve scalability. 


Copyright 1998 by BEA Systems, Inc. (BEA), 385 Moffett Park Drive, Sunnyvale, CA 94089-1208, 
U.S.A. 

BEA is a registered trademark of BEA Systems, Inc. M3, Jolt, and BEA Manager are trademarks of 
BEA Systems, Inc. TUXEDO is a registered trademark in the U.S. and other countries. All other 
company names may be trademarks of their respective companies with which they are associated. 

All rights reserved. No part of this document may be reproduced, photocopied, stored on a retrieval 
system, or transmitted without the express prior written consent of the publisher. All specifications 
subject to change without notice. 



Building High Volume Distributed Applications using CORBA 


201 



1.0 Introduction 


The inherent characteristics of object technology naturally lend themselves to the component-based 
design paradigm. For object technology to move from experimentation by advanced development 
groups to the mainstream of mission-critical application development, scalability issues must be 
addressed. Algorithms that worked fine for team development and prototyping often fall short when 
subjected to the demands of hundreds or even thousands of users. Three areas are particularly critical. 

1. Client/Server Connectivity 

The most straightforward implementation of an Object Server connects every client directly to 
every server it must interact with. As the number of clients and servers increases, connections 
increase exponentially, exceeding the capacity of the largest processors. 

2. Efficient State Management 

Increases in both the client population and system workload cause the number of active objects to 
increase and their demands on system memory to exceed the memory capacity of even the largest 
servers. To overcome this phenomenon requires different strategies for managing memory utiliza¬ 
tion. 

3. Operational Systems Management 

In the development environment, the programmer can easily deal with the administration require¬ 
ments of today’s object systems. When the application is ultimately deployed, the operations staff, 
not the developer, must be able to manage it in a dynamically changing environment, simply and 
efficiently. 

This collection of issues is not new. Commercial TP monitors like BEA TUXEDO® have wrestled 
with these same problems for decades and today offer robust, mature technology which has proven 
effective for customers deploying large mission-critical applications. BEA has built BEA M3 on the 
same mature infrastructure that powers BEA TUXEDO. Thus, BEA M3™ adds the object paradigm 
to these proven technologies to create the next generation of component middleware, making it 
uniquely suited to become the mainstream application platform of the next decade. 


2.0 Client/Server Connectivity __ 

Any distributed system needs to be able to deliver client requests to the server that will process them 
This means using the communication subsystem and establishing the necessary connections to 
transport these requests. Two techniques for managing these connections are described in this paper: 
direct connections and session concentrators. 


3 


202 


Building High Volume Distributed Applications using CORBA 


• With direct connections each client establishes a connection to the server which hosts the object it 
needs to communicate with. If one already exists, it may reuse it. 

You can think of this as the analogue of a direct telephone line between you and someone you 
wish to call. 

• With session concentrators, each client connects to a single gateway; the gateway then connects to 
each server. 

Using the telephone analogy again, this is like connecting your phone to the local exchange in 
order to place calls to anyone else connected to the phone system. 

2.1 Connectivity Alternatives 

Standard ORBs make a connection directly from each client to each object they use. While this may 
appear optimal at first, this consumes expensive resources as the number of clients (M) and servers 
(N) increases (telephone numbers in the telephone analogy) and thus runs out of capacity quickly 
(MxN gets very large as M and N get large). Typically direct connections have very low utilizations 
since clients invoke methods on objects in the same server very infrequently, especially if client 
requests are driven as a result of human interactions. By analogy, you use your phone to talk to 
multiple people in the course of the business day. 


Standard ORB 


BEA M3 Component System 



FIGURE 1. ORB Connectivity Architectures 


4 


Building High Volume Distributed Applications using CORBA 


203 







BEA M3 uses session concentrator technology and manages object interfaces as server classes. 

Servers are pre-started in BEA M3 to ensure that they are always reachable from the concentrator. 
Then, when requests arrive at the concentrator for a particular object, they are routed to the proper 
server. Completing the telephone analogy, by routing all calls through your local exchange, an 
optimal path to the telephone you are calling can be selected enabling the phone at the other end to 
ring. The cost of the concentrator (i.e. the local exchange) is more than compensated for in increased 
throughput between client and server (fewer physical phone lines are required, utilization is higher, 
and your phone can be connected to more destinations). 

Why do standard ORB’s use the direct connection method? The Interoperability Object Reference 
(IOR) defined as part of the CORBA 2.0 specification contains the IP address of the target object. 

Each time a new object reference is created, the client receives a new IOR with a new IP address. 
Today’s ORBs establish (or in the best case re-use) a direct connection between the client and the 
target object to transport a method request. This leads to a large number of connections and con¬ 
strains the number of clients each server can support, limiting scalability well before processing 
capacity is exceeded. 

2.2 Connection Multiplexing 

BEA M3 is the only object server to use the session concentrator model for connectivity. Borrowing 
from similar technologies provided by the workstation handlers for BEA TUXEDO and BEA Jolt 1M , 
BEA M3 introduces a new IIOP gateway for CORBA clients. Clients are connected to this gateway, 
and not the ultimate target object. The BEA M3 object servers are also connected to the gateway 
through a messaging backbone, so the server connections can be shared among all clients. This 
reduces the M x N connections required by standard ORBs to M+N connections in BEA M3. Because 
fewer connections are required for the same number of clients, the BEA M3 system supports more 
users and a higher throughput rate. 



FIGURE 2. Connection Multiplexing in BEA M3 


5 


204 


Building High Volume Distributed Applications using CORBA 


By routing all client requests through the session concentrator, BEA M3 can balance the load between 
multiple equivalent servers. This eliminates bottlenecks and provides a uniform distribution of 
workload. Availability is also improved since failures can be bypassed and requests routed to an 
available server. When the failure is repaired, the server is automatically restarted and again becomes a 
candidate for processing client requests. So, even though the gateway appears at first to introduce 
additional cost, it actually improves both throughput and availability. 

Use of session concentration technology by BEA M3 enhances scalability by separating two important 
tuning parameters—increases in the client population and increases in the transaction rate. Client 
population increases are accommodated by adding additional IIOP gateways. Transaction rate in¬ 
creases are accommodated by increasing the number of BEA M3 object servers which is done auto¬ 
matically as the workload increases. The result is that both tuning parameters now increase linearly 
rather than exponentially. 


3.0 Efficient State Management 


The principle difference between a traditional TP server application and an object is the presence of 
state. State is what makes objects fundamentally different from today’s procedural applications. 
Today’s object servers by default keep the object’s state in memory from the time the object is acti¬ 
vated until the server application explicitly destroys it. When large numbers of clients access large 
numbers of server objects, memory gets full, the system starts to page, and response time deteriorates 
The number of clients actually making method calls at any point in time is typically quite small 
allowing BEA M3 to more efficiently manage server memory. This is accomplished by an important 
component of BEA M3 - the TP Framework. 

3.1 BEA M3 TP Framework 

The TP Framework of BEA M3 is an example of a server framework and provides a higher level 
abstraction layer which sits between BEA M3’s server ORB, its object adaptor, and the user applica¬ 
tion. The TP Framework can be thought of as a container within which the application executes. The 
container provides both services used by the application and framework-initiated callbacks to the 
application. These callbacks provide plug-ins to handle the necessary systems functions—retrieving 
object state from persistent storage and saving object state back to persistent storage. Using a frame¬ 
work to implement these capabilities allows them to be coded separately from normal business func¬ 
tions. 



Building High Volume Distributed Applications using CORBA 


205 




ORB Core 


FIGURE 3. The BEA M3 TP Framework 

A particularly important function of the BEA M3 TP Framework is the activation and deactivation of 
object instances. Using the function defined by the OMG’s Portable Object Adaptor (POA), the 
framework calls back to the application to save and restore its state when the instance would other¬ 
wise be idle. This frees up memory for use by other objects actually servicing method calls allowing 
the system to scale beyond the memory requirements of all “active” objects. 

These state management functions happen within the BEA M3 object’s implementation and therefore 
are transparent to the client. The client makes requests as if the object were always in memory. If the 
object is not currently resident, the BEA M3 TP Framework calls the user’s application code to 
activate it on demand to respond to the client’s request. 

3.2 State Management Policies 

The BEA M3 TP Framework supports multiple state management policies. 

• A method-level policy which activates the object before every method request and deactivates it 
when each method terminates. 

This closely mimics a request/response service. 

• A transaction-level policy which activates the object when a transaction begins, leaves the object 
in memory for subsequent interactions, and then deactivates the object when the transaction ends. 

This is similar to a conversational service. 

I 7 


206 


Building High Volume Distributed Applications using CORBA 








• A process-level policy which activates the object on first request and deactivates it at normal 
process termination or when the applications requests deactivation. 

This is equivalent to the normal CORBA object behavior provided by a traditional ORB product. 
3.2.1 Method Level Activation 

The simplest is the method-level policy. With method-level state management, the object’s state is 
read from external storage each time the method is dispatched and written to external storage when 
the method completes. The application developer cooperates with the framework by implementing 
callback methods to save and restore the object’s state. 


This behavior is almost identical to stateless services with one important difference— the object is 
completely stateful. For the cost of the two framework calls to save and restore state, the object’s 
memory can be reclaimed by the BEA M3 system for use by other objects when it is inactive. The 
explicit state management of the BEA M3 TP Framework frees the developer from the requirement to 
write state management as part of the application and improves overall performance. 



FIGURE 4. Method Level Activation 


8 


Building High Volume Distributed Applications using CORBA 


207 













In Figure 4, the transaction outcome is commit, so the final version of the object state is made 
durable as part of the transaction. If the transaction were to rollback and the state was written to a 
transactional storage medium (e.g. a database), the final object state would be equal to the initial 
object state. 

3.2.2 Transaction Level Activation 

Many server application use transactions to coordinate updates to shared databases. Although the 
duration of a transaction can be a single request, as in the example above, many transactions require 
multiple interactions between client and server before changes can be committed. Such an application 
is a good candidate for a transaction-level activation policy. 



FIGURE 5. Transaction-Level Activation 


9 


208 


Building High Volume Distributed Applications using CORBA 







Figure 5 outlines the flow for a transaction-level activation policy. The object is activated when the 
transaction is started and the first method call is made. It remains active until the transaction termi¬ 
nates and commit processing is complete. If the transaction outcome is rollback (and the storage 
medium is transactional), the final object state is equal to the initial object state. 

This behavior is similar to a conversational service, except that it does not require a dedicated pro¬ 
cess. This is made possible since all of the object’s state is encapsulated within the object. 

An important property of the CORBA object model is the separation of interfaces from implementa¬ 
tions. The interface (embodied in the object reference) enables the client to make requests without 
being aware of the existence of an active implementation. By controlling the residency of active 
implementations, the BEA M3 TP Framework can manage memory more efficiently enabling more 
objects to be active than can otherwise be supported. This enhances BEA M3’s scalability well 
beyond what is possible with the default CORBA activation model. 


4.0 Operational Systems Management 


The early stages of CORBA development have focused on the services and functions needed by the 
application programmer. To date, few standards have been adopted to support management once the 
application is ready for deployment. And even those that have been adopted are just now beginning to 
appear in commercial ORB products. Successful deployment of these applications requires that 
management software be available to meet the needs of the operations staff, not the programmer. 

Without a management infrastructure available to all applications, this critical issue has been left in 
the hands of the application developer. Many object-oriented projects in large enterprises have failed 
in deployment because this very important need was not recognized during the development cycle. 
Perhaps even worse, successful projects have been forced to develop their own proprietary manage¬ 
ment solutions in order to get their applications into production, compromising the advantages of 
building applications to an otherwise industry standard set of interfaces. 

Few (if any) enterprises are willing to deploy applications in production which cannot be managed by 
the operations staff. BEA M3 leverages the proven management infrastructure of BEA TUXEDO 
freeing the application programmer from the need to produce object management. BEA M3’s man¬ 
agement capabilities include a Web-based GUI management tool, a programming interface for 



Building High Volume Distributed Applications using CORBA 


209 



management applications, and integration (via BEA Manager™) with the popular systems manage¬ 
ment consoles (OpenView, Unicenter, Tivoli, etc.). 


5.0 BEA M3 has the Right Stuff ___ 

Many of the shortcomings of today’s ORB products are, in fact, the strength of today’s TP monitors. 
This is the key to the BEA M3-design and has been recognized by many industry watchers including 
the Gartner Group. The phrase, “Object Transaction Manager (OTM)” was coined by Gartner to refer 
to the next generation of middleware. Gartner expects OTMs to be the mainstream deployment 
environment for enterprise business applications in the next decade. OTMs “combine the best fea¬ 
tures of TP monitors with the object programming paradigm.” 

The CORBA programming model offers the application developer a simpler way of building business 
applications. With widespread acceptance, it offers the developer a broader selection of tools, making 
it even easier to produce business applications. To that end, work is in progress within the OMG to 
add a component model to CORBA, compatible with Enterprise Java Beans, making programming by 
assembly a reality. 

While under development, BEA M3 was known as Iceberg because of its emphasis on the mission- 
critical infrastructure. Just like an iceberg, most of BEA M3’s value is below the water line, i.e. 
unseen by the application programmer as an API, but essential if mission-critical applications are to 
be deployed in the real world. BEA M3 is much more than an ORB plus the CORBA transaction 
service (OTS). 

• It incorporates connection multiplexing and object state management to fully realize the benefits 
of the object model in a TP environment. 

• It brings scalability and manageability beyond today’s ORB environments. 

• It makes it possible to build and deploy CORBA applications in the most demanding production 
environments. 

BEA M3 will finally translate the promise of object technology into measurable business benefits. 



210 


Building High Volume Distributed Applications using CORBA 


FIGURE 6. An ORB plus OTS is just the tip of the Iceberg 


Industry standards and specifications such as CORBA, EJB, and ActiveX play an important role in 
the industry but to build an enterprise system, they must be able to work together. BEA M3 pulls 
them together and adds the mission-critical application infrastructure proven inside BEA TUXEDO. 
The result is the industry’s first scalable object server and the first mission-critical OTM. 

In my 1995 article in Object Magazine, "TP Monitors and ORBs: A Superior Client/Server Solution 
I predicted that, “the marriage of ORB and TP monitor technologies will create a new infrastructure, 
exploiting the strengths of TP monitors, while providing a robust application development environ¬ 
ment based on reusable objects.” With the availability of BEA M3, we are about to enter the era of 
the OTM and component-based enterprise transaction systems. 


Building High Volume Distributed Applications using CORBA 


211 




6.0 Bibliography 


1. E. Cobb, The Impact of Object Technology on Commercial Transaction Processing , VLDB Jour¬ 
nal, 1997, Volume 6 Number 3, pp. 173-190. 

2. E. Cobb, TP Monitors and ORBs: A Superior Client/Server Alternative, Object Magazine, Febru¬ 
ary 1995 

3. Gartner Group, Object Transaction Monitors: The Foundation for a Component-based Enterprise, 
August 1997 

4. J. Gray, A. Reuter, Transaction Processing: Concepts and Techniques, Morgan Kaufmann Publish¬ 
ers, 1993 

5. R. Orfali, D. Harkey, Jeri Edwards, Instant CORBA, John Wiley, 1997 


13 


212 


Building High Volume Distributed Applications using CORBA 


AUUG 


September 1998 


Samba 

John H Terpstra, Aquasoft Pty Ltd (and Samba-Team) 


Abstract: 

This paper provides background information 
about Samba, it’s use, features, development, 
and about the people behind the scenes. The 
success of a complex and diverse product like 
Samba demands acknowledgment of current 
issues in the information technology industry 
as well as a moment’s reflection upon the 
strengths and weaknesses of Samba. A brief 
review of the development direction is also 
provided. 

Discussion: 

Some wish for a static world. One that would 
give us time to catch up, to understand a 
greater portion than the small fraction of 
knowledge and understanding we now pos¬ 
sess. Surely, this is not the world we know. 
Almost as soon as we begin to comprehend 
the rules and issues before us we notice yet 
more to conquer and behold. 

Andrew Tridgell 1 , author of Samba, never 
imagined from the outset that by September 
1998 there might be a usage base of Samba as 
large as it is. Neither is it really of any con¬ 
cern today amongst the Samba-Team whether 
the installed systems base is 1,000 units or 
2,500,000 units. For those who insist on an 
answer, it is probably somewhere in between 
those numbers. 

What matters is that the Samba-Team has 
arisen out of an earnest interest in understand¬ 
ing Server Message Block (SMB) technology 
used in many networking technologies. Of 
these the most prevalent products utilising 
SMB are the Microsoft Windows™ family of 
products. 

When Andrew first started to produce an 
SMB server he did not realise what the proto¬ 
col was called. Simply from sniffing a net¬ 
work he started to decode how the SMB pro¬ 
tocols work. Soon these observations were put 
into a sample implementation. Ten or more 


years later the result is a product that has had 
contributed input from well over 2,000 
people. Information technology people find 
themselves drawn to its use and development 
in a compulsive way. 

About 3 years ago, the load imposed by e- 
mail responses to Samba became so great that 
Andrew could find insufficient time to work 
on furthering code and protocol development. 
By that time a number of other people had 
been drawn towards Samba and had started to 
offer to off-load some of Andrew’s work in 
responding to e-mail so as to release him so 
that more effort could go into development of 
new code. Gradually the word was put out 
that help was needed or else Samba develop¬ 
ment would cease. 

The Samba-Team: 

A group of four people flew and drove to 
Canberra, Australia, and met at Andrew’s 
home to discuss how a group could be organ¬ 
ised to help in assuring the continued devel¬ 
opment of Samba. At the same time, it was 
recognised that consideration of user base 
needs should be at the fore-front of our think¬ 
ing. 

The result is a loosely affiliated group of 
people who share a common interest and who 
dedicate much time, effort and expense to a 
product that many have grown to like and use. 

The Samba-Team is globally dispersed in¬ 
cluding five Australians, two people in Lon¬ 
don, one in Germany, one in France, four in 
America, and more who drop in for a specific 
task and then settle into their routine interests 
again. The key team members include An¬ 
drew Tridgell, Jeremy Allison, Luke Leighton 
and John Terpstra. It would be an injustice to 
not mention the others and at the same time 
an injustice to do so too, since it will not be 
possible to mention all key contributions to 
the success of the Samba project. 


213 


SAMBA 



AUUG 


September 1998 


Paul Ashton, Paul Blackman, John D Blair, 
Christopher Hertel, Richard Sharpe, Dan 
Shearer, Dave Fenwick, Simon Hyde and oth¬ 
ers, have all left their indelible mark upon the 
Samba landscape. 

Paul Blackman has done a stunning job man¬ 
aging the Samba web page 
(http://samba.anu.edu.au/samba). Various 
Samba-Team members help to manage the 
FTP site (now mirrored to over 80 sites 
world-wide). 

Samba Support Facilities: 

The Samba-Team manages a number of mail¬ 
ing lists, each to cater for a special interest fo¬ 
cus. These lists are: 

Samba - general discussion and help 

Samba-announce - the title says it all 

Samba-ntdom - for those working on and in¬ 
terested in the development of Windows NT 
Domain Control technology. 

Samba-vms - discussion and help for Digital 
VMS users. 

Samba-cvs - for those interested in CVS 
(Code Versioning System) commit messages. 
These can be entertaining at times, particu¬ 
larly when those committing code forget that 
the world is watching every comment! 

Samba-docs - discussion of documentation 

Samba-technical - the propeller-head channel 

Samba-binaries - a communications channel 
for those who prepare and contribute Samba 
binary packages. 

The global network of Samba FTP sites car¬ 
ries all Samba source code releases, binary 
packages for all major platforms, contributed 
tools and utilities, and much more. 

The mailing lists are archived on the Samba 
server (samba.anu.edu.au) and are fully 
searchable. The knowledge bank that exists 
here is quite amazing. 


Bug Tracking: 

In addition there is an e-mail facility for sub¬ 
mission of bug reports, patches, and for those 
needing to contact the Samba-Team. This 
channel is available at: Samba- 

bugs® samba.anu.edu.au. In the past 12 
months Samba-bugs has received and handled 
over 10,000 messages and has processed 
many times that when all replies and follow 
up messages are included. 

In August 1997 the volume of mail on the 
Samba-bugs channel became so large that a 
formalised bug tracking system was imple¬ 
mented. Andrew Tridgell, with help from oth¬ 
ers, produced the Jitterbug web based bug 
tracking system that remains in use. This tool 
has made it possible to use a globally distrib¬ 
uted network of support volunteers. Other 
Open Source development teams now also 
utilise Jitterbug for bug-report, patch support 
and for general control of team coordination. 
These include Linux (Linus Torvalds) and the 
Gnome Team. 

The key Samba bug report support engineers 
are Jeremy Allison, Simon Hyde and John 
Terpstra. Several others have also provided 
valuable services to help manage the ever 
growing volume of messages. 

Recently, we instituted a system of automati¬ 
cally responding to all requests by return 
mailing of the most current frequently asked 
questions and answers. This way our users do 
not have to suffer unnecessary delays until a 
human can get around to answering their con¬ 
cerns. 


Month 

Bug Reports 

Aug. Sept, Oct - 97 

1400 

Nov 97 

710 

Dec 97 

710 

Jan 98 

950 

Feb 98 

1150 

Mar 98 

1000 

Apr 98 

1270 

May 98 

735 

June 98 

625 

Jul 98 

710 

Aug 98 

approx 1200 


214 


SAMBA 




AUUG 


September 1998 


The upsurge in bug reports in February 
through April 1998 was the result of release 
of Samba version 1.9.18 and consisted largely 
of requests from new users. 

About half of the messages received are from 
new installations, the remainder arise out of 
site update problems, either with Samba or 
with new MS Windows clients. 

There were a very significant number of bug 
reports resulting from Microsoft’s removal of 
support for plain text password handling with 
the release of MS Windows NT 4.0 Service 
Pack 3. The recent release of MS Windows 
98 has resulted in a similar number of UR¬ 
GENT requests for help as Windows 98 by 
default does not support plain text passwords. 


A Brief History: 

A surprising proportion of the current in¬ 
stalled base are still using versions that are 
older than 2 years (approximately 25%). 

The older versions still in common use are: 
1.9.15p8, 1.9.16p6, 1.9.16p9, 1.9.17p5. It 
would be foolish to even begin to estimate a 
percentage of these. 

As an indicator of development activity the 
following list shows the number of updates 
(patches) issued for each of the recent major 
versions: 


Version 

Updates 

1.9.15 

8 

1.9.16 

11 

1.9.17 

5 

1.9.18 

10 


The figures shown are the final numbers for 
production quality stable releases. Updates 
have been issued for security fixes, feature 
and usability enhancements, and for extension 
of support for platforms to which Samba has 
been ported. 


Samba Version 

Release Date 

1.9.17p4 

Oct 22, 97 

1.9.17p5 

Dec 20, 97 

1.9.18 

Jan 8, 98 

1.9.18pl 

Jan 13, 98 

1.9.18p2 

Jan 27, 198 

1.9.18p3 

Feb 19, 98 

1.9.18p4 

Mar 28,98 

1.9.18p5 

May 5, 98 

1.9.18p6 

May 11, 98 

1.9.18p7 

May 13, 98 

1.9.18p8 

Jun 13, 98 

1.9.18p9 (killed) 

Aug 13, 98 

1.9.18pl0 

Aug 2X, 98 


Basic support for and compatibility with Mi¬ 
crosoft clients became quite stable with the 
release of Samba-1.9.15p8. The 1.9.16 series 
was used to provide more comprehensive pro¬ 
tocol support for MS Windows NT and 95. 

The 1.9.17 series introduced stable support 
for encrypted passwords (but required the 
DES library at compilation time), brought 
about compliance with the Common Internet 
File System (CIFS) standards, and was used 
to introduce a radically re-engineered brows¬ 
ing technology as well as to launch re¬ 
designed file locking support. 

The Samba-1.9.18 series introduced MS Win¬ 
dows NT compatible Opportunistic File lock¬ 
ing, embedded encrypted password support, 
multi-national character set support and many 
other new features. 

From inception to the end of the 1.9.18 series 
all code development has been progressive 
and incremental. When re-engineering has 
been necessary the principal has been fol¬ 
lowed that only essential change can be 
implemented and all code has been subjected 
to extensive peer review before a code update 
was accepted. 

One of the best decisions made has been the 
inclusion of a Roadmap with the Samba 
source package since 1.9.17. We have sur¬ 
prised ourselves that the vision has been so 
clear. Even more surprising is the extent to 
which the vision was our guide! Our user base 


SAMBA 





AUUG 


September 1998 


noticed this document and we received much 
applause for the bravery (or fool-hardiness, as 
the day may have it) of including such a thing 
in an Open Source package. After all, com¬ 
mon logic will have us believe that this level 
of planning is alien to rapid development. We 
justified the Roadmap to our programmers by 
telling them its purpose was to pacify bug re¬ 
porters. Will the programmers ever trust us 
again? 


Current Development: 

For over six months now, many people have 
contributed to the development of the next 
generation of Samba. To date this has been 
known variously as the NTDOM code tree. It 
has also been known as the Samba-1.9.19 de¬ 
velopment stream. There is not likely to be a 
1.9.19 release because by general consensus 
the changes brought about in this stream have 
been too extensive and radical. Out of respect 
for our user base we want to reinforce the fact 
that the greater portion of the new code has 
been completely rewritten. This means that all 
user sites must exercise caution in it’s deploy¬ 
ment. 

We are however confident that the new code 
will stabilise much more rapidly than in the 
past. The code has been migrated to the use of 
autoconf to automate the pre-compilation 
configuration of Samba. This means that 
Samba should perform far better on a wider 
range of platforms than in the past. 

Many previously hard coded parameters are 
now either auto-tuning, or are configurable 
via the smb.conf file. 

The next version (probably Samba-2.0) will 
feature the Samba Web Administration Tool 
(SWAT), Windows NT Domain client sup¬ 
port, and limited ability to act as an MS Win¬ 
dows NT Domain controller. 

For the curious and more technically minded, 
Samba-2.0 will provide: 

1) Blocking byte-range locking support, for 
improved file locking. Previously a lock that 


could not be honored would result in an error 
message being returned to the client that 
could cause ill behaved client software to fail. 
(Done). 

2) Implementation of change Notify function¬ 
ality. In the past client software had to poll for 
file content changes. Now Samba will be able 
to notify the client that a file or record has 
changed. (Done). 

3) MS Windows NT SMB function calls. This 
will reduce the number of file access failures 
caused by buggy MS Windows applications 
that failed to respect correct file system ac¬ 
cess messaging. In other words, we will at last 
be more completely MS BUG compatible. 
(Done - but work in progress) 

4) The number of open files has been lifted 
from 100 to 4096 per client connection, and it 
is auto-tuning. (Done) 

5) We are hoping to implement codepage 
(multi-national character set) support for Net¬ 
BIOS names (computer name as well as the 
workgroup or domain name). (In progress) 

6) Tolerance for use of spaces in NetBIOS 
names. (To Do) [Ed: Why do people do this?] 

7) Password change capability from MS Win¬ 
dows NT machines. (In progress). 

8) The source code has been modularised far 
more extensively, and the directory tree has 
been modified so that like functionality re¬ 
sides together. This has made the code more 
comprehensible. This is an important factor in 
bug recognition and in the maintainability of 
the code. Andrew has made a Spartan effort 
to realise this objective. (Done) 

9) The source code has been extensively re¬ 
viewed and re-structured to ensure that it is 
security clean. (Done). 

Samba has grown in size, complexity and in 
functionality. We do not aim to provide per¬ 
fect and complete functionality at first release 
because that would delay release indefinitely. 
Instead, as soon as the new code begins to 
take stable shape and the rate of development 


216 


SAMBA 



AUUG 


September 1998 


begins to settle down Samba-2.0 will be re¬ 
leased for public alpha testing. We expect this 
to happen around mid-September 1998. 


Future Direction: 

The Samba-Team are not short of ideas and 
objectives. Our users seem to think we need 
some help and we like it that way. We also 
like the nice gifts and alms that users deliver 
our way by way of patches and feature en¬ 
hancements. It is totally re-assuring to see the 
quality and the quantity of patches that are be¬ 
ing submitted. In the past year there have 
been about 95 significant contributed patches, 
nearly all of which have been integrated into 
the Samba source code. 

We consistently see evidence of the benefits 
of the Open Source facility. Samba is distrib¬ 
uted under the GNU General Public Licence. 
All source code is available to all users. This 
results in a user base that is generally better 
informed and better empowered to project 
their own destiny. We believe this is the key 
to the success of Samba. 

If there are to be any wishes high on our 
needs table then the following will gain our 
green light treatment: 

a) There is a serious problem with implemen¬ 
tation of file locking. At this time it is not 
possible to implement data consistency con¬ 
trols between operating system level file ac¬ 
cess (Unix users) and Samba based file ac¬ 
cess. The current trade-off is one of perfor¬ 
mance against integrity. 

Kernel safe Opplock support for Unix is a 
must! 

There have been some ugly accidents when 
Unix users started to access files that are sub¬ 
ject to locking inside Samba. The operating 
system can have no knowledge of such inci¬ 
dents at this time. 

b) Samba is not yet compatible with the new 
MS Windows NT style of NetBIOS-less SMB 


over TCP/IP. It still lacks directory services 
architecture support. This is a liability. 

Ideally, we should provide support for Light - 
weight Directory Access Protocol (LDAP), 
Kerberos authentication, and the likes. These 
will be needed to replace the current browsing 
and authentication schemes. 

c) Completion of SMBLIB (the SMB Library) 
project. 

d) Integration of SMB capabilities into all 
standard Unix tools, eg: BASH, GNUtar, 
GNU git, Midnight Commander, and more. 

e) Implementation of a common configuration 
file structure between Samba, NetAtalk and 
Mars NetWare Emulator (Mars_NWE) to al¬ 
low use of a common web administration 
tool. 

f) Implementation of an SMB to NCP gate¬ 
way between Samba and Mars_NWE. 

g) Secure Socket Layer support in other SMB 
clients (Samba already has it). 

h) Implementation of Unicode for all opera¬ 
tions. 

i) 64bit clean file system support. 

j) Implementation of a means to dynamically 
bridge MS Windows NT distributed account 
databases with the Unix password database. 

k) Full Access Control List (ACL) support for 
MS Windows NT clients. 

l) Complete DCE/RPC protocol support. 

These are but the first few development op¬ 
portunities that spring to mind. 

If anyone were to poll the Samba-Team about 
their ultimate Samba dream they would find a 
single answer, "Complete Interoperability 
with all other SMB platforms so that no client 
will know the difference". This was put an¬ 
other way by one team member as, "To be to 
MS Windows NT better than any MS Win¬ 
dows NT Server can be!". 


SAMBA 


217 



AUTJG 


September 1998 


Samba Performance: 

In comparative but completely unofficial 
benchmark tests Samba can readily be dem¬ 
onstrated to be more scalable and to offer su¬ 
perior data throughput than any competing 
SMB server technology. 

The largest site we know of using Samba has 
115,000 MS Windows PC clients that run off 
a thousand or so Samba servers. The most 
amazing thing about this company is the fact 
that 48,000 MS Windows PC Clients are con¬ 
nected to 16 front-end servers running Samba. 
Three high-powered systems then provide the 
file serving via NFS. The average client load¬ 
ing over a 24 hour period is 452 PC clients 
per server and a peak of 3,000 clients per 
server. This is considerably more than any 
competing SMB server can handle. 

Samba can deliver data at up to 22MBytes/sec 
on a platform equivalent to one on which MS 
Windows NT can not deliver more than 
14MBytes/sec. The nearest competing Unix 
based SMB server on the same Unix platform 
peaks at under 8MBytes/sec. 

Samba has uptimes and availability ratings 
equivalent to the operating system uptime for 
the system it runs on. The best recorded sta¬ 
tistic we know of is 320 days of continuous 
use. 

Samba can be upgraded while the system is 
running and without need to interrupt any cli¬ 
ent. This is a most important feature to some 
organisations. 

Samba is extensively used in Government de¬ 
partments, Commerce, Banking and Finance, 
as well as in not for profit organisations. 

To list but a few users: 

First National Bank - USA 

Boeing Corporation - USA 

Suisse Credit Corp - Europe 


National Credit Insurance (Brokers) Pty Ltd - 
Australia 

SMC Pneumatics Pty Ltd - Australia 

We have active survey responses that covers 
over 2000 sites world wide. These range in 
size from 5 seats to 115000 seats. There are a 
surprising number of 10,000 plus seat users. 


Support: 

Samba is supported by a global network of 
independent support providers. There are over 
100 organisations listed in the Support.txt file 
that is bundled with every Samba source dis¬ 
tribution package (under the docs directory). 

Moves are under way to provide an Official 
Samba CDRom distribution. 

John D Blair has written the current official 
Samba guide. It is called "Samba: Integrating 
Unix and Windows". Publisher is: Specialised 
Systems Consultants, Inc. (SSC), ISBN: 1- 
57831-006-7. This book has been a best seller 
since it’s release early in 1998. 


Companies Supporting Samba: 

The Samba-Team wish to thank the following 
companies for their support: 

Silicon Graphics, USA - donated 0200 
server that houses the master Samba web site, 
FTP site and mail site. 

Digital Equipment Corp - donated AlphaS- 
tation 250 for support of Digital Unix and 
Open VMS users. 

IBM - Austin, TX, USA - donated AIX Sys¬ 
tem for support of AIX customers. 

Hewlett-Packard, USA - donated HPUX 
system to improve level of HPUX functional¬ 
ity. 


218 


SAMBA 



AUUG 


September 1998 


Sun Microsystems, USA - donated a Sun 
Sparc Station IPX for Samba development. 

NEC - UK - donated a set of PCs to aid 
Samba integration development. 

In addition, the Samba-Team would like to 
render special thanks to those companies who 
have extensively contributed to the develop¬ 
ment of Samba by employing staff to work on 
Samba development. These companies in¬ 
clude: 

Whistle Communications, Inc., USA 

Silicon Graphics, Inc. 

Aquasoft Pty Ltd, Australia. 

In the absence of financial support, a project 
as large and as complex as Samba could not 
exist. 


Summary: 

A few salient points emerge out of this paper. 

i) Samba has a large stable, growing user base 

ii) Samba is reliable 

iii) Samba is Open Standards compliant 

iv) Samba scales exceptionally well, as well 
as the platform it runs on 

v) Samba aids the growth and stability of the 
Unix market place 

vi) Samba puts the brakes on monopolistic 
tendencies by offering users a seriously ca¬ 
pable alternative file and print server plat¬ 
form. 


vii) Samba is the Open Source killer applica¬ 
tion of choice! 


1 Computer Science Dept., Australian Na¬ 
tional University. 

^Cobalt Networks, Inc., USA. 


SAMBA 


219 

























220 


SAMBA 



The Domain Name System: 
Engineering vs Economics 

Dr Kate Lance 
System Manager 
connect.com.au Pty. Ltd. 
clance@connect.com.au 
September 1998 

Abstract 

Over the last few years much attention has focused on the Domain Name 
System, as its functionality, so essential to Internet integrity, seems to have 
shifted from engineering utility to controversial cash cow. The Internet evolved 
as a self-governing community: its transition to a self-regulatory industry in 
Australia will depend upon whether “industry” means restricted to supply- 
side interests only, or whether it can evolve to mean something as inclusive 
and diverse as the community itself. This paper traces some of the turbulent 
history of Internet governance in Australia. 


1. An Engineering Problem 

Computers, routers and other Internet devices have names for human convenience 
but they pass traffic between each other using numerical addresses. In the late 60’s, 
70’s and early 80’s, when the Internet was basically the collaborative US government 
and research network ARPANET, the mapping of names to addresses was via a file, 
HOSTS.TXT, updated by email and and collected using FTP every few days. As the 
networks began to grow, this method caused name duplicates, inconsistencies, load 
and timeouts on the master machines and was clearly not scaleable. 

To help resolve these difficulties Paul Mockapetris wrote RFCs 882 and 883 (Nov 
1983) and RFCs 1034 and 1035 (Nov 1987) describing a new Domain Name System 
to automatically map between machine names and numbers. 

Machine names are hierarchical, e.g. yalumba.connect.com.au has the levels: 

yalumba the local name 
.connect the company 
•com the commercial sector 

.au Australia, the “top-level” for this domain name 

.au is our country-code (CC). There are two-letter abbreviations for all countries. 
Other top-level domains (TLDs) are not countries but functional or organisational 
categories: .com, .edu, .gov, .mil, .net, .org and .int. These names are fairly 
arbitrary Australian sites do not necessarily have to use .au names and commercial 
sites do not necessarily need .com or .com.au, 

2. A Hierarchy of Responsibility 


The Domain Name System: Engineering vs Economics 


221 



DNS files on well-known computers define who is responsible for authoritative 
(correct, trustworthy) information for the level one step below. For instance, the 
primary source for .au describes the locations of authoritative information about 
its second-level domains: com.au, net.au, gov.au, edu.au, org.au, asn.au, 
csiro.au, etc. The computers for (say) org.au define who is responsible for its 
third-level domains, such as isoc-au.org.au. 

Responsibility for the running of a domain is “delegated” to each domain below 
it. Acceptance of that delegated responsibility is also acceptance of the obligation 
to maintain the policy and logical consistency of the domain. This involves a “duty 
of care” in two areas: rigorous, accurate, technical administration and thoughtful 
compliance with the intent of the domain functionality. 

The top of the tree of responsibility is IANA, the Internet Assigned Numbers 
Authority, run by Jon Postel. He delegated country-code domains in the early days 
of the Internet to network-knowledgeable individuals, as trustees for their country 
domains. Such work was fairly tedious and routine, but it was essential that it be 
done, and done carefully, for the viability of the whole Internet. The country code 
.au was delegated in 1984 to network pioneer Robert Elz, from Melbourne University. 

3. The First Links 

In Australia the first general-use computer network was ACSnet, which used 
MHSnet software to pass email, ftp and Usenet News between Computer Science 
departments and research organisations like CSIRO. 

In September 1988 a report for the Australian Vice-Chancellors Committee 
(AVCC) and other academic and research bodies recommended the establishment 
of a national network. In March 1989, Geoff Huston from ANU was appointed Tech¬ 
nical Manager of the fledgling AARNet project. 

The design was simple and radical: multiprotocol routers in state hubs connected 
by leased lines (initially 48kbps) to the national hub in Canberra, thence to the 
international link. Individual sites were responsible for the management of their own 
connections to the AARNet routers. 

In a mere three weeks in May 1990, Huston and his deputy Peter Elford travelled 
all over Australia turning on the hubs while staff at the end-sites worked frantically 
to connect to the new network. AARNet went live. Huston became responsible for 
the second-level domains edu.au and gov.au. Elz continued to manage .au, net.au, 
org.au and the ACSNet domain, oz.au. 

AARNet permitted access to the first commercial Australian Internet service 
provider in 1992. This was connect.com.au Pty. Ltd., started by network engineer 
Hugh Irvine with Joanne Davis and Ben Golding. The domain net.au was delegated 
to Irvine in 1994, and was administered free of charge by Connect for some years as 
a service to the Internet community. 

In August 1995 Michael Malone, from the Perth ISP iiNet, was delegated re¬ 
sponsibility for the new second-level domain asn.au for associations and non-profit 


222 


The Domain Name System: Engineering vs Economics 



groups, also administered free of charge. 


4. The Other Address Hierarchy 

Domain names map to IP addresses, but how are IP addresses managed? Until the 
early 90’s the InterNIC (funded by US taxpayers) would allocate blocks of numerical 
addresses to be used to uniquely identify network hosts within a domain. In 1993 the 
National Science Foundation sought to devolve this function to regional, self-funded 
IP address registries, operating in accordance with RFC 1466. 

In September 1993 Geoff Huston applied to IANA for a large block of addresses 
“on behalf of the Australian network community”, with the ultimate goal of seeing 
a national IP address registry set up, both for regional autonomy and for efficiency 
(allocations from the US were taking weeks). It was to be “a totally independent 
entity, which operates within the broad structure of a not-for-profit service operation, 
and applies a single community policy in an open and fair manner”. 

The address space, in older terminology, was equivalent to 64 Class-B blocks, 
over 4 million individual host addresses; an enormous amount at a time that large 
allocations were becoming increasingly rare due to the potential exhaustion of the IP 
address space. 

By mid-1994 Huston, Hugh Irvine and AARNet engineer Andy Linton were con¬ 
sidering how such a registry could be set up. They decided the fastest way would 
be to set up a shell company, then bring on additional directors from the Australian 
network community to design the structure, articles and administrative details in 
consultation with the community. It was hoped that this would keep the address 
block free of potential ownership by AARNet, the AVCC or any other body (Telstra 
had already been suggested as a future AARNet owner). 

This was to be the Australian Internet Registry, AIR: initially for IP address 
administration but, if successful, a potential future home for Australian domain name 

administration. It was an engineering solution to what appeared to be an engineering 
problem. 

5. AARNet Blues 

1994 had been a year of turmoil in the self-awareness of the Australian Internet 
community. Keep in mind that the first widely-deployed Web browser, the elegant 
Mosaic, had appeared only in 1993 and was still relatively unknown at this stage— 
Web sites numbered only in the tens to hundreds, while Internet nodes, using text- 
based email, telnet, ftp and Usenet News, numbered in the millions. 

But even text-based communication was growing exponentially and the pressure 
or network expansion was never-ending. The AVCC had previously refused to accept 
a corporatised business plan that could have subsidised AARNet expansion, so they 
had limited funding for AARNet and no strategy at all to deal with its popularity. 

Late in 1993 substantial funds had been promised by the government to upgrade 


The Domain Name System: Engineering vs Economics 


223 



AARNet. Major forces such as Telecom (Telstra) and IBM submitted proposals 
that had only indirect association with AARNet but succeeded in diverting a large 
percentage of the funds to projects that appeared to be extensions of their own 
corporate research. The AVCC were outmanoeuvered and AARNet, under mounting 
pressure, was denied desperately-needed resources. 

In 1994 AARNet access was opened up further to VARs (service providers) un¬ 
der a volume (per-MByte) charging scheme, and it was proposed that from 1995 
full volume-charging for universities and research institutes would also begin. The 
prospect of volume-charging to recover costs was violently argued, especially in the 
newsgroup aus.net.aarnet, which had previously been a fairly civilised source of in¬ 
formation exchange. 

6. AIR Goes Up in Flames 

It was in this edgy atmosphere in late September 1994 that a journalist, who 
had been passed highly misleading information about the AIR scheme, published a 
story that “control of Internet addressing and name allocation has been privatised, 
which means Australian Internet users may soon have to ‘lease’ their addresses for 
an annual fee. The Australian Internet Registry (AIR) has ‘assumed responsibility’ 
for all Internet protocol (IP) address and domain name allocation... ” 

It didn’t matter that domain names had nothing to do with AIR speculation 
abounded that it would “sell domain names” and that in future names and addresses 
would be charged for, possibly at one dollar (!) each. The AIR directors released a 
press statement explaining the true situation, that the scheme was for public gover¬ 
nance and that fees had not been proposed, but it was too late. 

The newsgroup exploded with wild accusations and vicious personal attacks. The 
accuracy of the report was not questioned. The AIR directors were on holiday or did 
not read News, so were unaware of the turmoil until some days after it began. 

Robert Elz wrote a stinging rebuke to the madness, and the newsgroup settled 
down... but the damage was done. The community had shown that the ideals of 
trust and cooperation, so essential to the engineering that powered the Internet, had 
collapsed before the social phenomena of rampant self-interest, paranoia and pack 
hysteria. 

7. The Lost Opportunity 

Under extreme pressure from the AVCC the directors withdrew their proposal. 
The AIR experiment ended and an extraordinary opportunity was lost for self- 
governance. Personal relationships were fractured and the Australian Internet com¬ 
munity had inflicted a terrible injury on itself, one that has taken years to even begin 
to heal. 

AARNet was an embarrassment to the AVCC. They had neither the funds, the 
imagination nor the will to run the Australian backbone. Talks escalated with Telstra 
in late 1994. The prospect of AARNet going to the despised telco generated another 


224 


The Domain Name System: Engineering vs Economics 



eruption of furious argument on the lists and newsgroups—some AARNet staff left 
at the time, in part because of this issue—but in May 1995, AARNet management 
was transferred to Telstra. It was the end of an incredible era for Australia that had 
begun almost exactly five years before. 

The address block issued by the InterNIC on behalf of the Australian network 
community was claimed by the AVCC and passed on to Telstra. Because no indepen¬ 
dent registry existed, no sensible or transparent allocation policy for provider blocks 
was defined. As a result, owners of addresses from this block who are not Telstra 
customers (other than a few large ISPs) may encounter future problems with address 
portability and global routability. Telstra allocated to itself around one quarter of 
the address block and the non-Telstra portion was exhausted by February 1997. 

APNIC (the Asia-Pacific Network Information Centre) started as a pilot project in 
late 1993 based on volunteer labour and donated facilities from a number of countries. 
It evolved into the independent IP address registry so desperately needed by the Asia- 
Pacific region. APNIC has recently moved its headquarters to Australia but retains 
a strong regional focus. 

8. Governance Goes Commercial 

From 1983 the National Science Foundation, NSF, had the responsibility of man¬ 
aging the non-military part of the Internet infrastructure in the US. In 1985 NSF 
and its contractors began the development of a backbone called NSFNET, while 
ARPANET was gradually phased out. NSF set up an agreement with Network Solu¬ 
tions Inc (NSI) from January 1993 which permitted it to perform registration services 
on behalf of NSF. 

From October 1995 NSI charged for domain name registration, US$100 (plus $50 
per year ongoing) of which 30% was reserved for an as-yet unallocated “Intellectual 
Infrastructure Fund” (in early 1998 this had to be retrospectively legalised because 
it was equivalent to imposition of a tax). All over the world lists and newsgroups 
went ballistic but, as the company introduced better administrative practices after 
a very bumpy start, the outrage died down and people started to accept that name 
and address governance of the Internet had gone commercial. 

In fact, people started to notice that not only had it gone commercial but also 
that, for NSI at least, it had become a substantial source of income. What had been 
a cooperative engineering facility became in many peoples’ eyes a new cash cow: and 
what was worse, they weren’t getting their share! 


9. Meanwhile Back at the Ranch 

By late 1995 the administration of com.au domain names was under enormous 
pressure as business finally woke up to the potential of the Internet. Robert Elz was 
still personally handling it and the backload and delays were increasing. He also took 
a holiday at this time and the temporary replacement permitted many domain-names 
through that should have been rejected under the policy rules. 


The Domain Name System: Engineering vs Economics 


225 



This resulted in an inconsistency that has led to serious disagreements since then, 
as it naturally seems unfair to later applicants whose similarly variant names are re¬ 
jected. This was a fundamental lesson in the importance of consistency, the dangers of 
badly thought-out policy variations and their often horribly long-term consequences. 

Three years ago, at the AUUG’95 conference, a small group of people decided to 
start a mailing-list dedicated to discussion of Australian Internet matters, especially 
name and address governance and the setting up of an Australian version of ISOC, 
the Internet Society. The list was called inet-issues and, though it is almost defunct 
today, it had an interesting minor role in restarting public communication on these 
problems which, due to the traumas of the previous 12 months, had essentially ceased. 

Slowly, painfully, some public discussion on the issues restarted. Most contributors 
were now strongly aware of the dark side of mailing-list communication. It is one 
of the most successful and inclusive forms of public discussion yet devised, but it is 
also vulnerable to serious abuse by people who choose not to follow the rules of sane 
social discourse. 

In June 1996 Hugh Irvine sent a “Call for Formation” to the inet-issues mailing- 
list, stating that, in response to the widely-expressed need for an Australian Inter¬ 
net organisation, he’d commenced legal formalities to set up an Internet Society of 
Australia as a company limited by guarantee and operating as a not for profit organ¬ 
isation. The directors, objects, scope and format of the society were matters for the 
Australian Internet community to establish. 

Much debate occurred over the next five or six months, culminating in an Inaugu¬ 
ral General Meeting on 27th November 1996. The Society had around 350 founding 
members and enormous support from the Australian Internet community overall. Its 
Objects clearly positioned it as an organisation dedicated to the end-users of the 
networks and the public-good obligations of Internet governance bodies. 

Around the same time, as a response to the enormous workload of com.au admin¬ 
istration, Robert Elz gave a non-exclusive 5-year licence to Melbourne-IT, a commer¬ 
cial offshoot of Melbourne University (which had freely subsidised the cost of Elz’s 
labours on behalf of the Internet community for many years) to do the actual admin¬ 
istration of com.au. Melbourne-IT started charging for domain name registrations 
in November 1996, at $125-$150 a year. 

At Connect, demand for net.au names exploded as they were seen as free alter¬ 
natives to com.au names. After net.au administration almost collaped under the 
demand, Connect brought in charging at the same level as Melbourne-IT to prevent 
the land-grab on net.au names. 

10. Enter ADNA 

Melbourne-IT had initially planned to remove the registration of pre-existing 
com.au names whose owners had not paid them fees by mid-March 1997. A class 
action was brought by ISP iiNet on behalf of com.au owners to prevent this. ISOC- 
AU suggested changes to the timetable which assisted in the resolution of the case, 


226 


The Domain Name System: Engineering vs Economics 



and the period of grace for existing owners was extended until November 1997. 

Around the same time, moves to establish a domain name administration body 
had begun, driven by the Internet Industry Association of Australia (then called IN- 
TIAA, now part of IIA). Because of the court case the delegates for most domains 
were unable to be involved. Despite concern expressed by individuals, ISOC-AU, the 
AVCC and CSIRO about the haste of formation, lack of consultation and shortcom¬ 
ings of the Articles, the company ADNA (Australian Domain Name Administration) 
was incorporated in May 1997. 

ADNA’s view was that the scope of its governance included .au and all of the 
SLDs, but a widely-expressed reservation was that neither .au nor the smaller SLDs 
were the problem: in most eyes other than ADNA’s the main problem was that 
com.au was not open to competition. 

The AVCC and CSIRO refused to recognise any authority of the new body over the 
domains they used and the delegates for net.au (Irvine) and for edu.au and gov.au 
(Huston) also refused to be involved. The delegate for org.au and .au itself (Elz) 
remained outside of the discussion. Only asn.au (Malone) and com.au (Melbourne- 
IT) joined the organisation, as registrars. ADNA had only eight members—six in¬ 
dustry associations and two registrars. (One association has now resigned.) 

ISOC-AU also did not join, but directors attended Board meetings as observers 
and volunteered to help amend the organisation’s Articles to address community 
concerns. Changes were proposed to make a public-benefit role explicit, to clearly 
separate policy from operations, and to include all of the current delegates on the 
board to try to ensure a smooth transition to a new scheme of governance. 

Over this time disagreements arose between the ADNA directors themselves re¬ 
garding details of the minutes; legal action between directors was threatened and 
some of the directors either resigned or simply stopped attending meetings. 

By early 1998 ADNA was no closer to gaining the support of the Internet commu¬ 
nity than when it had begun. It had withdrawn from discussion with ISOC-AU. It 
had passed several highly controversial motions at meetings with only a few directors 
in attendance, but had no authority to put them into action anyway. 

The only positive was that Malone had released a first version of shared registry 
software, a technical necessity for multiple DNS administration. But the opening up 
of the lucrative com.au domain to competition, which had been confidently expected 
to happen by November 1997, was still only a remote possibility (and has not occurred 
by September 1998). 

11. The Memorandum of Understanding 

The agreement between NSF and NSI concludes officially at the end of September 
1998. Planning for what will come next has consumed the efforts of many of the most 
productive and far-sighted people in the world-wide Internet communitv over the last 
few years. 


The Domain Name System: Engineering vs Economics 


227 



In May 1996 Jon Postel opened discussion on the possibility of new multiple top- 
level domain name registries (those that deal with generic domains like .com, not 
the country-code ones). After much discussion, the International Ad Hoc Committee 
was set up in September 1996 to organise public input into a report to suggest new 
procedures for international DNS management. 

On February 28 1997 the IAHC released the gTLD-MoU, the generic Top Level 
Domain Memorandum of Understanding, which requested public support for a set of 
proposed international DNS bodies: a Policy Oversight Committee (POC), a Policy 
Advisory Body (PAB) and a Council of Registrars (CORE). Over 220 international 
organisations to date have signed their agreement to the gTLD-MoU, including ISOC- 
AU and five other Australian Internet organisations. 

The gTLD-MoU proposed seven new gTLDs and a not-for-profit, public-good top- 
level registry working in association with an openly competitive system of registrars 
(CORE members) rather than the anti-competitive combination of the two functions, 
as at NSI. 

The CORE members started building a centralised shared registry software sys¬ 
tem. Most invested reasonably large sums of money setting up businesses to handle 
registry traffic when the system was due to start operations in March 1998. 

At this stage NSI’s income from DNS was around four million US dollars per 
month. The company began a campaign of high-level lobbying of US legislators in 
late 1997—most of whom had never heard of the DNS—with the startling suggestion 
that the IAHC was a Swiss conspiracy to take over the Internet, and that Jon Postel 
was “double-dealing” the US government. 

12. Green Paper, White Paper... 

The US government responded. It appointed Ira Magaziner to produce a draft 
Green Paper called “A Proposal to Improve Technical Management of Names and 
Addresses” which appeared on January 30 1998. The cooperative gTLD-MoU effort 
ground to a halt. 

The Green Paper borrowed a number of ideas, without attribution, from the 
gTLD-MoU, but a major area of disagreement was the proposal that the top-level 
registries be open to competition, which basically indicated a lack of understanding of 
their function or of their vulnerability to abuse. International discussion was invited, 
and it certainly took place. Comments flooded in from individuals, small businesses, 
international Internet groups, legal organisations, telecommunications bodies and 
sovereign governments. 

The flaws in the Green Paper were fairly obvious to most of the Internet com¬ 
munity. In Australia, even the quarreling factions around ADNA were able to get 
together in March 1998 and, under the auspices of NOIE (the National Office for 
the Information Economy, charged with providing policy advice on IT to the govern¬ 
ment), were able to agree on a common set of criticisms of the plan, which appeared 
in the Australian government’s official response to the US government. 


228 


The Domain Name System: Engineering vs Economics 



On June 5 1998 the US government replied with the White Paper “Management 
of Internet Names and Addresses”. This appeared to take into account many of the 
concerns of the international community. Further meetings for consultation, called 
the “International Forum on the White Paper”, have been held in the US in early July, 
Europe (Geneva) in late July, and the Asia-Pacific region (Singapore) in mid-August. 

The consensus seems to be that a new IANA should be set up as a “not-for-profit, 
cost-recovery, nonpartisan corporation for charitable and public purposes, dedicated 
to preserving the operational stability of the central coordinating functions of the 
global Internet for the public good”. It would coordinate assignment of Internet 
technical parameters, manage the coordination of Internet address space, manage 
the coordination of the Internet domain name system and oversee operation of the 
authoritative Internet root server system. 

Discussion to bring about broad-based support on the structure, international rep¬ 
resentation and responsibilities of the new organisation is currently active. The hope 
is to have something ready to take on the function of NSI by the end of September 
1998. 

13. What a Long, Strange Trip it’s Been 

... from cooperative engineering utility to the frenzied focus of world-wide gov¬ 
ernments; from boring administrivia to commercial power-grab. 

In Australia we still don’t have a functional organisation to handle our .au top- 
level domain and its hierarchy. The system holds together because of the inherent 
engineering strengths of the DNS and because of the trustees, the delegates who have 

put in years of effort maintaining one of the essential elements of Australian Internet 
connectivity. 

We have seen that several times in the brief history of the Internet in Australia 
efforts to set up organisations for Internet governance have failed, usually because of 
various combinations of fear, self-interest, ignorance, mistrust and inertia. 

We have one last chance to bring it about. Meetings and discussions are con¬ 
tinuing, some with the guidance of NOIE as a source of expertise on industry self¬ 
regulation. We must achieve consensus on: 

• a broad-based definition of stakeholders 

• clearly-defined objectives and principles 

• inclusive, transparent and accountable processes 

• dispute resolution procedures 

• compliance mechanisms 

• funding and resources 

• a possible legislative framework 


The Domain Name System: Engineering vs Economics 


229 



As an emerging Internet community we have (often) worked well together under 
a regime of self-governance—rough consensus and working code. But now, as a new 
industry, we have to show we are mature enough to cooperatively bring about self¬ 
regulation in domain name governance. No-one else is going to do it for us. 

How do we make the transition from a community to an industry? Industry so 
often means simply the supply-side of the transaction: the demand-side is assumed to 
be no more than a passive consumer. But the Internet shatters those assumptions, it 
grew out of the grass-roots cooperation of its users, and its functional infrastructure, 
such as the DNS, clearly reflects that heritage. 

Supply and demand are not opposite sides of the fence on the Internet, service 
providers are also customers; users can provide content as compelling as that of any 
large business; independent volunteers write much of the code that makes it all work. 
The complex interdependencies of the world-wide mesh of networks means that no- 
one is the supplier, no-one is the consumer, no-one runs it and everyone runs it. 

A few selfless volunteers helped keep it going because they appreciated its value 
in the broadest possible sense—long before anyone else did. Now the world sees that 
value and is rushing to to turn it into cash. While the Internet is famously resilient 
to damage it still has fundamental vulnerabilities, such as its dependence on the 
integrity of the DNS at every level. Only time will tell if the precision engineering of 
Internet connectivity will survive the gold-rush. 


230 


The Domain Name System: Engineering vs Economics 



Portions of text in this paper have been extracted from Quality of Service: Delivering QoS on the 
Internet and in Corporate Networks, by Paul Ferguson and Geoff Huston, published by John Wiley 
& Sons, January 1998, ISBN 0-471-24358-2. 


Quality of Service on the Internet: Fact, 
Fiction, or Compromise? 

Paul FERGUSON <ferguson@cisco.com> 

Cisco Systems, Inc. 

USA 

Geoff HUSTON <gih@telstra.net> 

Telstra Internet 
Australia 

Abstract 

The Internet has historically offered a single level of service, that of "best effort," where all data 
packets are treated with equity in the network. However, we are finding that the Internet itself does 
not offer a single level of service quality, and some areas of the network exhibit high levels of 
congestion and consequently poor quality, while other areas display consistent levels of high 
quality service. Customers are now voicing a requirement to define a consistent service quality they 
wish to be provided, and network service providers are seeking ways in which to implement such a 
requirement. This effort is happening within the umbrella called "Quality of Service" (QoS). Of 
course, this is now a phrase which has become overly used, often in vague, nondefinitive 
references. QoS discussions currently embrace abstract concepts, varying ideologies, and moreover, 
lack a unified definition of what QoS actually is and how it might be implemented. Subsequently, 
expectations regarding QoS have not been appropriately managed within the Internet community at 
large of how QoS technologies might be realistically deployed on a global scale. A more important 
question is whether ubiquitous end-to-end QoS is even realistic in the Internet, given the fact that 
the decentralized nature of the Internet does not lend itself to homogenous mechanisms to 
differentiate traffic. This paper examines the various methods of delivering QoS in the Internet, and 
attempts to provide an objective overview on whether QoS in the Internet is fact, fiction, or a matter 
of compromise. 

Contents 

• 1 Introduction 

• 2 QoS and network engineering, design, and architecture 

• 3 QoS tools 

O 3.1 Physical layer mechanics 

■ 3.1.1 Alternate physical paths 
O 3.2 Link layer mechanics 

■ 3.2.1 ATM 

■ 3.2.2 Frame relay 

■ 3.2.3 IEEE 802.Ip 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


231 


O 3.3 Network and transport layer mechanics 

■ 3.3.1 TCP congestion avoidance 

■ 3.3.2 Preferential congestion avoidance at intermediate nodes 

■ 3.3.3 Scheduling algorithms to implement differentiated service 

■ 3.3.4 Traffic shaping non-adaptive flows 
O 3.4 Integrated services and RSVP 

• 4.0 Conclusions 

• 5.0 References 

• Acknowledgments 

1 Introduction 

It is hard to dismiss the entrepreneurial nature of the Internet today - this is no longer a research 
project. For most organizations connected to the global Internet, it’s a full-fledged business interest. 
Having said that, it is equally hard to dismiss the poor service quality that is frequently experienced 
- the rapid growth of the Internet, and increasing levels of traffic, make it difficult for Internet 
users to enjoy consistent and predictable end-to-end levels of service quality. 

What causes poor service quality within the Internet? The glib and rather uninformative response is 
"localized instances of substandard network engineering which is incapable of carrying high traffic 
loads." 

Perhaps the more appropriate question is "what are the components of service quality and how can 
they be measured?" Service quality in the Internet can be expressed as the combination of 
network-imposed delay, jitter, bandwidth and reliability. 

Delay is the elapsed time for a packet to be passed from the sender, through the network, to the 
receiver. The higher the delay, the greater the stress that is placed on the transport protocol to 
operate efficiently. For the TCP protocol, higher levels of delay imply greater amounts of data held 
"in transit" in the network, which in turn places stress on the counters and timers associated with the 
protocol. It should also be noted that TCP is a "self-clocking" protocol, where the sender’s 
transmission rate is dynamically adjusted to the flow of signal information coming back from the 
receiver, via the reverse direction acknowledgments (ACKs), which notify the sender of successful 
reception. The greater the delay between sender and receiver, the more insensitive the feedback 
loop becomes, and therefore the protocol becomes more insensitive to short term dynamic changes 
in network load. For interactive voice and video applications, the introduction of delay causes the 
system to appear unresponsive. 

Jitter is the variation in end-to-end transit delay (in mathematical terms it is measurable as the 
absolute value of the first differential of the sequence of individual delay measurements). High 
levels of jitter cause the TCP protocol to make very conservative estimates of round trip time 
(RTT), causing the protocol to operate inefficiently when it reverts to timeouts to reestablish a data 
flow. High levels of jitter in UDP-based applications are unacceptable in situations where the 
application is real-time based, such as an audio or video signal. In such cases, jitter causes the 
signal to be distorted, which in turn can only be rectified by increasing the receiver’s reassembly 
playback queue, which effects the delay of the signal, making interactive sessions very cumbersome 
to maintain. 

Bandwidth is the maximal data transfer rate that can be sustained between two end points. It should 
be noted that this is limited not only by the physical infrastructure of the traffic path within the 
transit networks, which provides an upper bound to available bandwidth, but is also limited by the 


232 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



number of other flows which share common components of this selected end-to-end path. 

Reliability is commonly conceived of as a property of the transmission system, and in this context, 
it can be thought of as the average error rate of the medium. Reliability can also be a byproduct of 
the switching system, in that a poorly configured or poorly performing switching system can alter 
the order of packets in transit, delivering packets to the receiver in a different order than that of the 
original transmission by the sender, or even dropping packets through transient routing loops. 
Unreliable or error-prone network transit paths can also cause retransmission of the lost packets. 
TCP cannot distinguish between loss due to packet corruption and loss due to congestion, and 
packet loss invokes the same congestion avoidance behavior response from the sender, causing the 
sender’s transmit rates to be reduced by invoking congestion avoidance algorithms even though no 
congestion may have been experienced by the network. In the case of UDP-based voice and video 
applications, unreliability causes induced distortion in the original analog signal at the receiver’s 
end. 

Accordingly, when we refer to differentiated service quality, we are referring to differentiation of 
one or more of these four basic quality metrics for a particular category of traffic. 

Given that we can define some basic parameters of service quality, the next issue is how is service 
quality implemented within the Internet? 

The Internet is composed of a collection of routers and transmission links. Routers receive an 
incoming packet, determine the next hop interface, and place the packet on the output queue for the 
selected interface. Transmission links have characteristics of delay, bandwidth and reliability. Poor 
service quality is typically encountered when the level of traffic selecting a particular hop exceeds 
the transmission bandwidth of the hop for an extended period of time. In such cases, the router’s 
output queues associated with the saturated transmission hop begin to fill, causing additional transit 
delay (increased jitter and delay), until the point is reached where the queue is filled, and the router 
is then forced to discard packets (reduced reliability). This in turn forces adaptive flows to reduce 
their sending rate to minimize congestion loss, reducing the available bandwidth for the application. 
Poor service quality can be generated in other ways, as well. Instability in the routing protocols may 
cause the routers to rapidly alter their selection of the best next hop interface, causing traffic within 
an end-to-end flow to take divergent paths, which in turn will induce significant levels of jitter, and 
an increased probability of out-of-order packet delivery (reduced reliability). 

Accordingly, when we refer to the quality of a service, we are looking at these four metrics as the 
base parameters of quality, and it must be noted that there are a variety of network events which can 
affect these parameter values. 

Also, it should be noted that in attempting to take a uniform "best effort" network service 
environment and introduce structures which allow some form of service differentiation, the tools 
which allow such service environments to be constructed are configurations within the network’s 
routers designed to implement one or more of the following: 

• Signal the lower level transmission links to use a different transmission servicing criteria for 
particular service profiles, 

• Alter the next hop selection algorithm to select a next hop which matches the desired service 
levels, 

• Alter the router’s queuing delay and packet discard algorithms such that packets are 
scheduled to receive transmission resources in accordance with their relative service level, 
and 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


233 



• Alter the characteristics of the traffic flow as it enters the network to conform to a contracted 
profile and associated service level. 

The "art" of implementing an effective QoS environment is to use these tools in a way which can 
construct robust differentiated service environments. 

From this perspective, the concept of service quality is important to understand, as opposed to what 
most people call quality of service, or QoS. Service quality can be defined as delivering consistently 
predictable service, to include high network reliability, low delay, low jitter, and high availability. 
QoS, on the other hand, can be interpreted as a method to provide preferential treatment to some 
arbitrary amount of network traffic, as opposed to all traffic being treated as "best effort," and in 
providing such preferential treatment, attempting to increase the quality level of one or more of 
these basic metrics for this particular category of traffic. There are several tools available to provide 
this differentiation, ranging from preferential queuing disciplines to bandwidth reservation 
protocols, from ATM-layer congestion and bandwidth allocation mechanisms to traffic-shaping, 
each of which may be appropriate dependent on what problem is being solved. We do not see QoS 
as being principally concerned about attempting to deliver "guaranteed" levels of service to 
individual traffic flows within the Internet. While such network mechanisms may have a place 
within smaller network environments, the sheer size of today’s Internet effectively precludes any 
QoS approach which attempts to reliably segment the network on a flow-by-flow basis. The major 
technology force which has driven the explosive growth of the Internet as a communications 
medium is the use of stateless switching systems which provide variable best effort service levels to 
intelligent peripheral devices. Recent experience has indicated that this approach has extraordinary 
scaling properties, where the stateless switching architectures can scale easily into scales of gigabits 
per second, preserving a continued functionality where the unit cost of stateless switching has 
decreased at a level which is close to the basic scaling rate. 

We also suggest that if a network cannot provide a reasonable level of service quality, then 
attempting to provide some method of differentiated QoS on the same infrastructure is virtually 
impossible. This is where traditional engineering, design, and network architectural principles play 
a significant role. 

2 QoS and network engineering, design, and architecture 

Before examining methods to introduce QoS into the Internet, it is necessary to examine methods 
of constructing a network that exemplify sound network engineering principles — scalability, 
stability, availability, and predictability. 

Typically, this exercise entails a conservative approach in designing and operating the network, 
undertaking measures to ensure that the routing system within the network remains stable, and 
ensuring that peak level traffic flows sit comfortably within the bandwidth and switching 
capabilities of the network. Unfortunately, some of these seminal principles are ignored in favor of 
maximizing revenue potential. For example, it is not uncommon for an ISP (Internet Service 
Provider) to oversubscribe an access aggregation point by a factor of 25 to 1, especially in the case 
of calculating the number of subscribers compared to the number of available modem ports, nor is it 
uncommon to see interprovider exchange points experiencing peak packet drop rates in excess of 
20% of all transit traffic. In a single quality best effort environment, the practice of oversubscription 
allows the introduction of additional load into the network at the expense of marginal degradation 
to all existing active subscribers. This can be a very dangerous practice, and if miscalculated, can 
result in seriously degraded service performance (due to induced congestion) for all subscribers. 
Therefore, oversubscription should not be done arbitrarily. 


234 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



Network engineering is arguably a compromise between engineering capabilities for the average 
load levels, and engineering capabilities which are intended to handle peak load conditions. In the 
case of dimensioning access ports, close attention must be paid to user traffic characteristics and 
modem pool port usage, based on time-of-day and day-of-week, for an extended period prior to 
making such an engineering commitment. Even after such traffic analysis has been undertaken, the 
deployed configuration should be closely monitored on a continuing and consistent basis to detect 
changes in usage and characteristics. It is very easy to assume that only a fraction of subscribers 
may be active at any given time, or to assume average and peak usage rates, but without close 
observation and prior traffic sampling, a haphazard assumption could dramatically affect the 
survival of your business. 

Equally, it is necessary for the network operator to understand the nature, size, and timing of traffic 
flows that are carried across the network’s transmission systems. While a critical component of 
traffic analysis is the monitoring of the capacity on individual transmission links, monitoring the 
dispersion of end-to-end traffic flows allows the network operator to ensure that the designed 
transmission topology provides an efficient carriage for the data traffic. It also attempts to avoid the 
situation where major traffic flows are routed sub-optimally across multiple hops, incurring 
additional cost and potentially imposing a performance penalty through the transit through 
additional routing points. 

The same can be said of maintaining stability in the ISP’s network routing system. Failure to create 
a highly stable routing system can result in destinations being intermittently unreachable, and 
ultimately frustrating customers. Care should be taken on all similar infrastructure and "critical" 
service issues, such as DNS (Domain Name System) services. The expertise of the engineering and 
support staff will be reflected in the service quality of the network, like it or not. 

Having said that, it is not difficult to understand that poorly designed networks do not lend 
themselves to QoS scenarios, due to the fact that if acceptable levels of service quality cannot be 
maintained, then it is quite likely that adding QoS in an effort to create some level of service 
differentiation will never be effective. Granted, it may allow the network performance to degrade 
more "gracefully" in times of severe congestion for some applications operated within the group of 
elevated QoS customers, but limiting the impact of degradation for some, at the cost of increasing 
the impact for the remainder of the customer base, is not the most ingenious or sensible use of QoS 
mechanisms. The introduction of QoS differentiation into the network is only partially effective if 
those customers who do not subscribe to a QoS service are adversely impacted. After all, if 
non-QoS subscribers are negatively impacted, they will seek other service providers for their 
connectivity, or they will be forced to subscribe to the QoS service to obtain an acceptable service 
level. This last sentence requires a bit of unconventional logic, since not all subscribers can 
realistically be QoS subscribers - this violates some of the most fundamental QoS strategies. 

The design principles which are necessary to support effective QoS mechanisms can be expressed 
in terms of the four base service quality parameters noted in the previous section — delay, jitter, 
bandwidth, and reliability. In order to minimize delay, the network must be based upon a 
transmission topology which reflects the pattern of end-to-end traffic flows, and a routing system 
design which attempts to localize traffic such that minimal distance paths are always preferred. In 
order to minimize jitter, the routing system must be held in as stable a state as possible. Router 
queue depths must also be configured so that they remain within the same order of magnitude in 
size as the delay bandwidth product of the transmission link that is fed by the queue. Also, 
unconditional preferential queuing mechanisms should be avoided in favor of weighting or similar 
fair access queue mechanisms, to ensure that all classes of traffic are not delayed indefinitely while 
awaiting access to the transmission resources. Selection of maximum transfer unit (MTU) sizes 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


235 



should also be undertaken to avoid MTUs which are very much greater than the delay bandwidth 
product of the link -- again as a means of minimizing levels of network-induced jitter. 

In terms of overall reliability, the onus is on the network architect to use transmission media which 
have a very low intrinsic bit error rate, and to use router components which have high levels of 
availability and stability. Care should also be taken to ensure that the routing system is configured 
to provide deterministic outcomes, minimizing the risk of packet reordering. Transmission capacity, 
or bandwidth, should be engineered to minimize the level of congestion-induced packet loss within 
the routers. This is perhaps not so straightforward as it sounds, given that transmission capacity is 
one of the major cost elements for an Internet network service provider, and the network architect 
typically has to assess the trade-off between the cost performance of the network, and the duration 
and impact of peak load conditions on the network. Typically, the network architect looks for 
average line utilization, and "busy hour" to "average hour" utilization ratios to provide acceptable 
levels of economic performance, while looking at busy hour performance figures to ensure that the 
network does not revert into a condition of congestion collapse at the points in time where usage is 
at a maximum. 

It is only after these basic design steps have been undertaken, and a basic level of service quality 
achieved within the network, that the issue of QoS (or service level differentiation) can be 
examined in any productive manner. The general conclusion here is that you cannot introduce QoS 
mechanisms to salvage a network which is delivering very poor levels of service. In order to be 
effective, QoS mechanisms need to be implemented in a network that is soundly engineered and 
which operates in a stable fashion under all levels of offered load. 

3 QoS tools 

Now that we have provided a framework definition for QoS, there are several mechanisms (and 
architectural implementations) which can provide differentiation for traffic in the network. We 
break these mechanisms into three basic groups, which align with the lower three layers of the OSI 
reference model — the Physical, Link, and Network layers [Figure 1]. 


Frame-Relay, 
ATM 


Figure 1 


OSI Reference TCP/IP Protocol 

Model Suite 


/Application 


Presentation 


Session 


Transport 


Network 


Link Layer 


Physical 


/Application 


Transport 


Network 


Link Layer 


TCP, UDP 


IP Routing 


3.1 Physical layer mechanics 


236 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



The physical layer (also referred to as LI, or Layer 1), consists of the physical wiring, fiber optics, 
and the transmission media in the network itself. It is reasonable to ask how Layer 1 physical media 
figures within the QoS framework, but the time-honored practice of constructing diverse physical 
paths in a network is, perhaps ironically, a primitive method of providing differentiated service 
levels. In some cases, diverse paths are constructed primarily to be used by network layer routing to 
provide for redundant availability, should the primary physical path fail for some reason, although 
often the temptation to share the load across the primary and backup paths is overwhelming. This 
can lead to adverse performance where, for example, having more than one physical path to a 
destination can theoretically allow for some arbitrary amount of network traffic to take the primary 
low-delay, high-bandwidth path, while the balance of the traffic takes a backup path which may 
have different delay and bandwidth properties. In turn, such a configuration leads to reduced 
reliability and increased jitter within the network as a consequence, unless the routing profile has 
been carefully constructed to stabilize the traffic segmentation between the two paths. 

3.1.1 Alternate physical paths 

While the implementation of provisioning diverse physical paths in a network is usually done to 
provide for backup and redundancy, this can also be used to provide differentiated services if the 
available paths each have differing characteristics. In Figure 2, for example, best-effort traffic could 
be forwarded by the network layer devices (routers) along the lower-speed path, while higher 
priority (QoS) traffic could be forwarded along the higher-speed path. 


Priarty 

Mi ...B BBfr 



Ba*L~EPa4 

Figure 2 

Alternatively, the scenario could be a satellite path complemented by a faster terrestrial cable path. 
Best effort traffic would be passed along the higher delay satellite path, while priority traffic would 
be routed along the terrestrial cable system. 

Certainly, this type of approach is indeed primitive, and not without its pitfalls. 

Destination-Based Routing and Path Selection 

The method in which IP packets are forwarded in the Internet is based on the destination contained 
in the packet header. This is termed destination-based routing, and the byproduct of this mode of 
packet forwarding is that since packets are generally switched based on a local decision of the best 
path to the IP destination address contained in the IP packet header, the mechanisms which do exist 
to forward traffic based on its IP source address are not very robust. Accordingly, it is difficult to 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


237 





perform outbound path selection based on the characteristics of the traffic source. The default form 
of path selection is based on the identity of the receiver. This implies that a QoS differentiation 
mechanism using path selection would be most efficiently implemented on selection of incoming 
traffic, while outgoing traffic would adopt the QoS parameters of the receiver. A destination-based 
routed network cannot control the QoS paths of both incoming and outgoing traffic to any particular 
location. Each QoS path will be determined as a destination-based path selection, leading to the 
observation that in a heterogeneous QoS environment, asymmetric quality parameters on incoming 
and outgoing data flows will be observed. This is a significant issue for unidirectional UDP-based 
traffic flows, where it becomes the receiver who controls the quality level of the transmission, not 
the sender. It is also significant for TCP, where the data flow and the reverse ACK flows may take 
paths of differing quality. Given that the sender adapts its transmission rate via signaling ,which 
transits the complete round trip, the resultant quality of the entire flow is influenced by the lower of 
the two quality levels of the forward and reverse paths. 

TCP and symmetric path selection 

Reliable traffic transmission (TCP) requires a bidirectional data flow — sessions which are initiated 
and established by a particular host (sender) generally require control traffic (e.g., explicit 
acknowledgments, or the notification that acknowledgments were not processed by the receiver) to 
be returned from the destination (receiver). This reverse data flow is used to determine the 
transmission success, out-of-order traffic reception at the receiver, transmission rate adaptation, or 
other maintenance and control signals, in order to operate correctly. In essence, this reverse flow 
allows the sender to infer what is happening along the forward path and at the receiver, allowing the 
sender to optimize the flow data rate to fully utilize its fair share of the forward paths resource 
level. Therefore, for a reliable traffic flow which is transmitted along a particular path at a 
particular differentiated quality level, the flow will need to have its return traffic flow traverse the 
same path at the same quality level if optimal flow rates are to be reliably maintained (this routing 
characteristic is also known as symmetric paths). Asymmetric paths in the Internet continue to be a 
troubling issue with regard to traffic which is sensitive to induced delay and differing service 
quality levels, which effectively distort the signal being generated by the receiver. This problem is 
predominantly due to local routing policies in individual administrative domains through which 
traffic in the Internet must traverse. It is especially unrealistic to expect path symmetry in the 
Internet, at least for the foreseeable future. 

The conclusion is that path diversity does allow for differentiated service levels to be constructed 
from the different delay, bandwidth, and load characteristics of the various paths. However, for 
reliable transmission applications, this differentiation is relatively crude. 

3.2 Link layer mechanics 

There exists a belief that traffic service differentiation can be provided with specific link layer 
mechanisms (also referred to as Layer 2, or L2), and traditionally this belief in differentiation has 
been associated with Asynchronous Transfer Mode (ATM) and Frame Relay in the wide area 
network (WAN), and predominantly with ATM in the local area network (LAN) campus. A brief 
overview is provided here of how each of these technologies provides service differentiation, and 
additionally, we provide an overview of the newer IEEE 802. Ip mechanics which may be useful to 
provide traffic differentiation on IEEE 802 style LAN media. 

3.2.1 ATM 

ATM is one of the few transmission technologies which provide data-transport speeds in excess of 


238 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



155 Mbps today. As well as a high-speed bit-rate clock, ATM also provides a complex subset of 
traffic-management mechanisms, Virtual Circuit (VC) establishment controls, and various 
associated QoS parameters for these VCs. It is important to understand why these underlying 
transmission QoS mechanisms are not being exploited by a vast number of organizations that are 
using ATM as a data-tr an sport tool for Internet networks in the wide area. The predominate use of 
ATM in today’s Internet networks is simply because of the high data-clocking rate and 
multiplexing flexibility available with ATM implementations. There are few other transmission 
technologies which provide such a high speed bit-rate clock. 

However, it is useful to examine the ATM VC service characteristics and examine their potential 
applicability to the Internet environment. 

3.2.1.1 Constant bit rate (CBR) 

The ATM CBR service category is used for virtual circuits that transport traffic at a consistent bit 
rate, where there is an inherent reliance on time synchronization between the traffic source and 
destination. CBR is tailored for any type of data for which the end-systems require predictable 
response time and a static amount of bandwidth continuously available for the lifetime of the 
connection. The amount of bandwidth is characterized by a Peak Cell Rate (PCR). These 
applications include services such as video conferencing, telephony (voice services), or any type of 
on-demand service, such as interactive voice and audio. For telephony and native voice 
applications, AAL1 (ATM Adaptation Layer 1) and CBR service is best suited to provide 
low-latency traffic with predictable delivery characteristics. In the same vein, the CBR service 
category typically is used for circuit emulation. For multimedia applications, such as video, you 
might want to choose the CBR service category for a compressed, frame-based, streaming video 
format over AAL5 for the same reasons. 

3.2.1.2 Real-time and non-real-time variable bit rate (rt- and nrt-VBR) 


The VBR service categories are generally used for any class of applications that might benefit from 
sending data at variable rates to most efficiently use network resources. Real-Time VBR (rt-VBR), 
for example, is used for multimedia applications with lossy properties, applications that can tolerate 
a small amount of cell loss without noticeably degrading the quality of the presentation. Some 
multimedia protocol formats may use a lossy compression scheme that provides these properties. 
Non-Real-Time VBR (nrt-VBR), on the other hand, is predominantly used for transaction-oriented 
applications, where traffic is sporadic and bursty. 

The rt-VBR service category is used for connections that transport traffic at variable rates — traffic 
that relies on accurate timing between the traffic source and destination. An example of traffic that 
requires this type of service category are variable rate, compressed video streams. Sources that use 
rt-VBR connections are expected to transmit at a rate that varies with time (e.g., traffic that can be 
considered bursty). Real-time VBR connections can be characterized by a Peak Cell Rate (PCR), 
Sustained Cell Rate (SCR), and Maximum Burst Size (MBS). Cells delayed beyond the value 
specified by the maximum CTD (Cell Transfer Delay) are assumed to be of significantly reduced 
value to the application, thus, delay is indeed considered in the rt-VBR service category. 

The nrt-VBR service category is used for connections that transport variable bit rate traffic for 
which there is no inherent reliance on time synchronization between the traffic source and 
destination, but there is a need for an attempt at a guaranteed bandwidth or latency. An application 
that might require an nrt-VBR service category is Frame Relay interworking, where the Frame 
Relay CIR (Committed Information Rate) is mapped to a bandwidth guarantee in the ATM 
network. No delay bounds are associated with nrt-VBR service. 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


239 



3.2.1.3 Available bit rate (ABR) 


The ABR service category is similar to nrt-VBR, because it also is used for connections that 
transport variable bit rate traffic for which there is no reliance on time synchronization between the 
traffic source and destination, and for which no required guarantees of bandwidth or latency exist. 
ABR provides a best-effort transport service, in which flow-control mechanisms are used to adjust 
the amount of bandwidth available to the traffic originator. The ABR service category is designed 
primarily for any type of traffic that is not time sensitive and expects no guarantees of service. ABR 
service generally is considered preferable for TCP/IP traffic, as well as other LAN-based protocols, 
that can modify its transmission behavior in response to the ABR’s rate-control mechanics. 

ABR uses Resource Management (RM) cells to provide feedback that controls the traffic source in 
response to fluctuations in available resources within the interior ATM network. The specification 
for ABR flow control uses these RM cells to control the flow of cell traffic on ABR connections. 
The ABR service expects the end-system to adapt its traffic rate in accordance with the feedback so 
that it may obtain its fair share of available network resources. The goal of ABR service is to 
provide fast access to available network resources at up to the specified Peak Cell Rate (PCR). 

3.2.1.4 Unspecified bit rate (UBR) 

The UBR service category also is similar to nrt-VBR, because it is used for connections that 
transport variable bit rate traffic for which there is no reliance on time synchronization between the 
traffic source and destination. However, unlike ABR, there are no flow-control mechanisms to 
dynamically adjust the amount of bandwidth available to the user. UBR generally is used for 
applications that are very tolerant of delay and cell loss. UBR has enjoyed success in Internet LAN 
and WAN environments for store-and-forward traffic, such as file transfers and e-mail. Similar to 
the way in which upper-layer protocols react to ABR’s traffic-control mechanisms, TCP/IP and 
other LAN-based traffic protocols can modify their transmission behavior in response to latency or 
cell loss in the ATM network. 

3.2.1.5 The misconceptions about ATM QoS 

Several observations must be made to realize the value of ATM QoS and its associated complexity. 
This section attempts to provide an objective overview of the problems associated with relying 
solely on ATM to provide QoS in the network. However, it sometimes is difficult to quantify the 
significance of some issues because of the complexity involved in the ATM QoS delivery 
mechanisms, and their interactions with higher-layer protocols and applications. In fact, the 
inherent complexity of ATM and its associated QoS mechanisms may be a big reason why many 
network operators are reluctant to implement those QoS mechanisms. 

While the underlying recovery mechanism of ATM cell loss is signaling, the QoS structures for 
ATM VCs are excessively complex, and when tested against the principle of Occam’s Razor (a 
popular translation frequently used in the engineering community for years is, "All things being 
equal, choose the solution that is simpler"), ATM by itself would not be the choice for QoS 
services, simply because of the complexity involved, compared to other technologies that provide 
similar results. Having said that, however, the application of Occam’s Razor does not provide 
assurances that the desired result will be delivered — instead, it simply expresses a preference for 
simplicity. 

ATM enthusiasts correctly point out that ATM is complex for good reason — in order to provide 
predictive, proactive, and real-time services, such as dynamic network resource allocation, resource 


240 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



guarantees, virtual circuit rerouting, and virtual circuit path establishment to accommodate 
subscriber QoS requests, but ATM’s complexity is unavoidable. The underlying model of ATM is a 
heterogeneous client population where the real time service models assume simple clients which are 
highly intolerant of jitter, while the adaptive models assume very highly sophisticated clients which 
can opportunistically tune their data rates to variations in available capacity which may fluctuate 
greatly within time frames well inside normal end-to-end round trip times. 

It also has been observed that higher-layer protocols, such as TCP/IP, provide the end-to-end 
transportation service in most cases, so that although it is possible to create QoS services in a lower 
layer of the protocol stack (namely ATM in this case), such services may cover only part of the 
end-to-end data path. This gets to the heart of the problem in delivering QoS with ATM, when the 
true end-to-end bearer service is not pervasive ATM. Such partial QoS measures often have their 
effects masked by the effects of the traffic distortion created from the remainder of the end-to-end 
path in which they do not reside, and hence the overall outcome of a partial QoS structure often is 
ineffectual. In other words, if ATM is not pervasively deployed end-to-end in the data path, efforts 
to deliver QoS using ATM can be unproductive. There is traffic distortion introduced into the ATM 
landscape by traffic-forwarding devices which service the ATM network and upper-layer protocols 
such as IP, TCP, and UDP, as well as other upper-layer network protocols. Queuing and buffering 
introduced into the network by routers and non-ATM-attached hosts skew the accuracy with which 
the lower-layer ATM services calculate delay and delay variation (jitter). 

On a related note, some have suggested that most traffic on ATM networks would be primarily 
UBR or ABR connections, because higher-layer protocols and applications cannot request specific 
ATM QoS service classes, and therefore cannot fully exploit the QoS capabilities of the VBR 
service categories. A cursory examination of deployed ATM networks and their associated traffic 
profiles reveals that this is indeed the case, except in the rare instance when an academic or 
research organization has developed its own native "ATM-aware" applications that can fully exploit 
the QoS parameters available to the rt-VBR and nrt-VBR service categories. Although this certainly 
is possible, and has been done on several occasions, real-world experience reveals that this is the 
proverbial exception and not the rule. 

It is interesting to note the observations published by Jagannath and Yin [1], which suggest that "it 
is not sufficient to have a lossless ATM subnetwork from the end-to-end performance point of 
view," especially in the case of ABR services. This observation is due to the fact that two distinct 
control loops exist — ABR and TCP [Figure 3]. Although it generally is agreed that ABR can 
effectively control the congestion in the ATM network, ABR flow control simply pushes the 
congestion to the edges of the network (i.e., the routers), where performance degradation or packet 
loss may occur as a result. Jagannath and Yin also point out that "one may argue that the reduction 
in buffer requirements in the ATM switch by using ABR flow control may be at the expense of an 
increase in buffer requirements at the edge device (e.g., ATM router interface, legacy LAN to ATM 
switches)." Because most applications use the flow control provided by TCP, one might question 
the benefit of using ABR flow control at the subnetwork layer, because UBR (albeit with Early 
Packet Discard) is equally effective and much less complex. ABR flow control also may result in 
longer feedback delay for TCP control mechanisms, and this ultimately exacerbates the overall 
congestion problem in the network. 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


241 



Host 


ATM 

Router Switch 


ATM 

Switch Router 


Host 



Figure 3 

Aside from traditional data services that may use UBR, ABR, or VBR services, it is clear that 
circuit-emulation services which may be provisioned using the CBR service category can certainly 
provide the QoS necessary for telephony communications. However, this becomes an exercise in 
comparing apples and oranges. Delivering voice services on virtual digital circuits using circuit 
emulation is quite different from delivering packet-based data found in local-area and wide-area 
networks. Providing QoS in these two environments is substantially different - it is substantially 
more difficult to deliver QoS for data, because the higher-layer applications and protocols do not 
provide the necessary hooks to exploit the QoS mechanisms in the ATM network. As a result, an 
intervening router must make the QoS request on behalf of the application, and thus the ATM 
network really has no way of discerning what type of QoS the application may truly require. This 
particular deficiency has been the topic of recent research and development efforts to address this 
shortcoming and investigate methods of allowing the end-systems to request network resources 
using RSVP, and then map these requests to native ATM QoS service classes as appropriate. 

The QoS objective for networks similar in nature to the Internet lies principally in directing the 
network to alter the switching behavior at the IP layer, so that certain IP packets are delayed or 
discarded at the onset of congestion or delay, or completely avoid if at all possible, the impact of 
congestion on other classes of IP traffic. When looking at IP-over-ATM, the issue (as with 
IP-over-Frame Relay) is that there is no mechanism for mapping such IP-level directives to the 
ATM level, nor is it desirable, given the small size of ATM cells and the consequent requirement 
for rapid processing or discard. Attempting to increase the complexity of the ATM cell discard 
mechanics to the extent necessary to preserve the original IP QoS directives by mapping them into 
the ATM cell is arguably counterproductive. 

Thus, it appears that the default IP QoS approach is best suited to IP-over-ATM. It also stands to 
reason that if the ATM network is adequately dimensioned to handle burst loads without the 
requirement to undertake large-scale congestion avoidance at the ATM layer, there is no need for 
the IP layer to invoke congestion-management mechanisms. Thus, the discussion comes full circle 
to an issue of capacity engineering, and not necessarily one of QoS within ATM. 

3.2.2 Frame relay 

Frame Relay’s origins lie in the development of ISDN (Integrated Services Digital Network) 
technology, where Frame Relay originally was seen as a packet-service technology for ISDN 
networks. The Frame Relay rationale proposed was the perceived need for the efficient relaying of 
HDLC framed data across ISDN networks. With the removal of data link-layer error detection, 
retransmission, and flow control, Frame Relay opted for end-to-end signaling at the transport layer 


242 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


of the protocol stack to undertake these functions. This allows the network switches to consider 
data-link frames as being forwarded without waiting for positive acknowledgment from the next 
switch. This in turn allows the switches to operate with less memory and to drive faster circuits 
with the reduced switching functionality required by Frame Relay. 

3.2.2.1 Frame relay rate management control structures 

Frame Relay is a link layer protocol which attempts to provide a simple mechanism for arbitration 
of network oversubscription. Frame Relay decouples the characteristics of the network access link 
from the characteristics of the virtual circuits which connect the access system to its group peers. 
Each virtual circuit is configured with a traffic committed information rate (CIR), which conforms 
to a commitment on the part of the network to provide traffic delivery. However, any virtual circuit 
can also accept overflow traffic levels - bursts which may transmit up to the rate of the access link. 
Such excess traffic is marked by the network access gateway using a single bit indicated in the 
Frame Relay frame header, termed the "Discard Eligible" (DE) bit. 

The interior of the network uses three basic levels of threshold to manage switch queue congestion. 
At the first level of queue threshold, the network starts to mark frames with Explicit Congestion 
Notification (ECN) bits. Frame relay congestion control is handled in two ways — congestion 
avoidance and congestion recovery. Congestion avoidance consists of a Backward Explicit 
Congestion Notification (BECN) bit and a Forward Explicit Congestion Notification (FECN) bit, 
both of which are also contained in the Frame Relay frame header. The BECN bit provides a 
mechanism for any switch in the frame relay network to notify the originating node (sender) of 
potential congestion when there is a build-up of queued traffic in the switch’s queues. This informs 
the sender that the transmission of additional traffic (frames) should be restricted. The FECN bit 
notifies the receiving node of potential future delays, informing the receiver to use possible 
mechanisms available in a higher-layer protocol to alert the transmitting node to restrict the flow of 
frames. The implied semantics of the congestion notification signaling is to notify senders and 
receivers to reduce their transmission rates to the CIR levels, although this action is not forced upon 
them. At the second level of queue threshold, the switch discards packets which are marked as DE, 
honoring its commitment to traffic which conforms to committed information rates on each circuit. 

The basic premise within Frame Relay networks is that the switching fabric is dimensioned at such 
a level that it can fulfill its obligations of committed traffic flows. If this is not the case, and 
discarding of all discard eligible traffic fails to remove the condition which is causing switch 
congestion, the switch passes the third threshold of the queue, where it discards frames which form 
part of committed flow rates. 

Frame Relay allows a basic level of oversubscription of basic point-to-point virtual circuits, where 
individual flows can increase their transfer rate, with the intent of occupying otherwise idle 
transmission capacity which is not being utilized by other virtual circuits which share the same 
transmission paths. When the sender is not using all of the committed rate within any of the 
configured virtual circuits, other VCs can utilize the transmission space with discard eligible 
frames. 

Frame Relay indicates that it is possible to provide reasonable structures of basic service 
commitment, together with the added capability of provision of overcommitment using a very 
sparse link level signaling set — the DE, FECN, and BECN bits. 


3.2.2.2 Frame relay and Internet QoS 

Frame Relay is certainly a good example of what is possible with relatively sparse signaling 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


243 



capability. However, the match between Frame Relay as a link layer protocol, and QoS mechanisms 
for the Internet, is not a particularly good one. 

Frame Relay networks operate within a locally defined context of using selective frame discard as a 
means of enforcing rate limits on traffic as it enters the network. This is done as the primary 
response to congestion. The basis of this selection is undertaken without respect to any hints 
provided by the higher-layer protocols. The end-to-end TCP protocol uses packet loss as the 
primary signaling mechanism to indicate network congestion, but it is recognized only by the TCP 
session originator. The result is that when the network starts to reach a congestion state, the method 
in which end-system applications are degraded matches no particular imposed policy, and in this 
current environment, Frame Relay offers no great advantage over any other link layer technology in 
addressing this. 

One can make the observation that in a heterogeneous network that uses a number of link layer 
technologies to support end-to-end data paths, the Frame Relay ECN and DE bits are not a panacea 
-- they do not provide for end-to-end signaling, and the router is not the system that manages either 
end of the end-to-end protocol stack. The router is more commonly performing IP packet into 
Frame Relay encapsulation. With this in mind, a more functional approach to user-selection of DE 
traffic is possible, one that uses a field in the IP header to indicate a defined quality level via a 
single discard eligibility field, and allow this designation to be carried end-to-end across the entire 
network path. With this facility, it then is logical to allow the ingress IP router (which performs the 
encapsulation of an IP datagram into a Frame Relay frame) to set the DE bit according to the bit 
setting indicated in the IP header field, and then pass the frame to the first-hop Frame Relay switch, 
which then can confirm or clear the DE bit in accordance with locally configured policy associated 
with the per-VC CIR. 

The seminal observation regarding the interaction of QoS mechanisms within various levels of the 
model of the protocol stack is that without coherence between the link layer transport signaling 
structures and the higher-level protocol stack, the result, in terms of consistency of service quality, 
is completely chaotic. 

3.2.3 IEEE 802.1p 

It should be noted that an interesting set of proposed enhancements is being reviewed by the IEEE 
802.1 Internetworking Task Group. These enhancements would provide a method to identify 
802-style frames based on a simple priority. A supplement to the original IEEE MAC Bridges 
standard [2], the proposed 802.Ip specification [3] provides a method to allow preferential queuing 
and access to media resources by traffic class, on the basis of a "priority" value signaled in the 
frame. The IEEE 802.Ip specification, if adopted, will provide a way to transport this value (called 
user priority) across the subnetwork in a consistent method for Ethernet, token ring, or other 
MAC-layer media types using an extended frame format. Of course, this also implies that 
802.1p-compliant hardware may have to be deployed to fully realize these capabilities. 

The current 802. Ip draft defines the user priority field as a 3-bit value, resulting in a variable range 
of values between zero and seven decimal, with seven indicating the highest relative priority and 
zero indicating the lowest relative priority. The IEEE 802. Ip proposal does not make any 
suggestions on how the user priority should be used by the end-system or by network elements — it 
only suggests that packets may be queued by LAN devices based on their relative user priority 
values. 

While it is clear that the 802. Ip user priority may indeed prove to be useful in some QoS 


244 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



implementations, it remains to be seen how it will be most practically beneficial. At least one 
proposal exists [4] which suggests how the 802.Ip user priority values may be used in conjunction 
with the Subnet Bandwidth Manager (SBM), a proposal that allows LAN switches to participate in 
RSVP signaling and resource reservation objectives [5]. However, bearing in mind that the 
widespread deployment of RSVP in the global Internet is not wholly practical (as discussed in more 
detail below), it remains to be seen how QoS implementations will use this technology. 

3.3 Network and transport layer mechanics 

In the global Internet, it is undeniable that the common bearer service is the TCP/IP protocol suite, 
therefore, IP is indeed the common denominator. (The TCP/IP protocol suite is commonly referred 
to simply as IP - this has become the networking vernacular used to describe IP, as well as ICMP, 
TCP, and UDP.) This thought process has several supporting lines of reason. The common 
denominator is chosen in the hope of using the most pervasive and ubiquitous protocol in the 
network, whether it be Layer 2 or Layer 3 (the network layer). Using the most pervasive protocol 
makes implementation, management, and troubleshooting much easier and yields a greater 
possibility of successfully providing a QoS implementation that actually works. 

It is also the case that this particular technology operates in an end-to-end fashion, using a signaling 
mechanism that spans the entire traversal of the network in a consistent fashion. IP is the end-to-end 
transportation service in most cases, so that although it is possible to create QoS services in 
substrate layers of the protocol stack, such services only cover part of the end-to-end data path. 

Such partial measures often have their effects masked by the effects of the signal distortion created 
from the remainder of the end-to-end path in which they are not present, or introduce other signal 
distortion effects, and as mentioned previously, the overall outcome of a partial QoS structure is 
generally ineffectual. 

When the end-to-end path does not consist of a single pervasive data-link layer, any effort to 
provide differentiation within a particular link-layer technology most likely will not provide the 
desired result. This is the case for several reasons. In the Internet, for example, an IP packet may 
traverse any number of heterogeneous link-layer paths, each of which may (or may not) possess 
characteristics that inherently provide methods to provide traffic differentiation. However, the 
packet also inevitably traverses links that cannot provide any type of differentiated services at the 
link layer, rendering an effort to provide QoS solely at the link layer an inadequate solution. 

It should also be noted that the Internet today carries three basic categories of traffic, and any QoS 
environment must recognize and adjust itself to these three basic categories. The first category is 
long held adaptive reliable traffic flows, where the end-to-end flow rate is altered by the end points 
in response to network behavior, and where the flow rate attempts to optimize itself in an effort to 
obtain a fair share of the available resources on the end-to-end path. Typically, this category of 
traffic performs optimally for long held TCP traffic flows. The second category of traffic is a 
boundary case of the first category, short duration reliable transactions, where the flows are of 
very short duration, and the rate adaptation does not get established within the lifetime of the flow, 
so that the flow sits completely within the startup phase of the TCP adaptive flow control protocol. 
The third category of traffic is an externally controlled load unidirectional traffic flow, which is 
typically a result of compression of a real time audio or video signal, where the peak flow rate may 
equal the basic source signal rate, and the average flow rate is a byproduct of the level of signal 
compression used, and the transportation mechanism is an unreliable traffic flow with a UDP 
unicast flow model. Within most Internet networks today, empirical evidence indicates that the first 
category of traffic accounts for less than 1 % of all packets, but as the data packets are typically 
large, this application accounts for some 20% of the volume of data. The second category of traffic 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


245 



is most commonly generated by World Wide Web servers using the HTTP/1.0 application protocol, 
and this traffic accounts for some 60% of all packets, and a comparable relative level of volume of 
data carried. The third category accounts for some 10% of all packets, and as the average packet 
size is less than one-third of the first two flow types, it currently accounts for some 5% of the total 
data volume. 

In order to provide elevated service quality to these three common traffic flow types, there are three 
different engineering approaches that must be used. To ensure efficient carriage of long held, high 
volume TCP flows requires the network to offer consistent signaling to the sender regarding the 
onset of congestion loss within the network. To ensure efficient carriage of short duration, TCP 
traffic requires the network to avoid sending advance congestion signals to the flow end-points. 
Given that these flows are of short duration and low transfer rate, any such signaling will not 
achieve any appreciable load shedding, but will substantially increase the elapsed time that the flow 
is held active, which results in poor delivered service without any appreciable change in the relative 
allocation of network resources to service clients. To ensure efficient carriage of the externally 
clocked UDP traffic requires the network to be able to, at a minimum, segment the queue 
management of such traffic from adaptive TCP traffic flows, and possibly replace adaptation by 
advance notification and negotiation. Such a notification and negotiation model could allow the 
source to specify its traffic profile in advance, and have the network respond with either a 
commitment to carry such a load, or indicate that it does not have available resources to meet such 
an additional commitment. 

As a consequence, it should be noted that no single transport or network layer mechanism will 
provide the capabilities for differentiated services for all flow types, and that a QoS network will 
deploy a number of mechanisms to meet the broad range of customer requirements in this area. 

3.3.1 TCP congestion avoidance 

As noted above, there are two major types of traffic flow in the Internet today. One is an 
adaptive-rate, reliable transmission, control-mediated traffic flow using TCP. The other is 
externally clocked, unreliable data flows which typically use UDP. Here, we look at QoS 
mechanisms for TCP, but to preface this it is necessary to briefly describe the behavior of TCP 
itself in terms of its rate control mechanisms. 

TCP uses a rate control mechanism to achieve a sustainable steady state network load. The intent of 
the rate control mechanism is to reach a state where the sender injects a new data packet into the 
network at the same time that the receiver removes a data packet. However, this is modified by a 
requirement to allow dynamic rate probing, so that the sender attempts to inject slightly more data 
than is being removed by the receiver, and when this rate probing results in congestion, the sender 
will reduce its rate and again probe upwards to find a stable operating point. This is accomplished 
in TCP using two basic mechanisms. 

The first mechanism used by TCP is slow start, where the connection is initialized with the 
transmission of a single segment. Each time the sender receives an ACK from the receiver, the 
congestion window ( cwnd ) is increased by one segment size. This effectively doubles the 
transmission rate for each round trip time ( RTT) cycle — the sender transmits a single packet and 
awaits the corresponding ACK. A TCP connection commences with an initial cwnd value of one, 
and a single segment is sent into the network. The sender then awaits the reception of the matching 
ACK from the receiver. When received, the cwnd is opened from one to two, allowing two packets 
to be sent. When each ACK is received from these two segments, the congestion window is 
incremented. The value of cwnd is then four, allowing four packets to be sent, and so on (receivers 


246 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



using delayed ACK will moderate this behavior such that the rate increase will be slightly less than 
doubled for each RTT). 

This behavior will continue until the transfer is completed, or the sender’s buffer space is 
exhausted, in which case the sender is transmitting at its maximal possible rate across the network’s 
selected path, given the delay bandwidth product of the path, or an intermediate router in the path 
experiences queue exhaustion, and packets are dropped. Given that the algorithm tends to cluster 
transmission events at epoch intervals of the RTT, such overload of the router queue structure is 
highly likely when the path delay is significant. 

The second TCP rate control mechanism is termed congestion avoidance. In the event of packet 
loss, as signaled by the reception of duplicate ACKs, the value of cwnd is halved, and this value is 
saved as the threshold value to terminate the slow start algorithm ( ssthresh ). When cwnd exceeds 
this threshold value, the window is increased in a linear fashion, opening the window by one 
segment size in each RTT interval. The value of cwnd is bought back to one when the end-to-end 
signaling collapses, and the sender times out on waiting for any ACK packets from the receiver. 
Since the value of cwnd is below the ssthresh value, TCP also switches to slow start control mode, 
doubling the congestion window with every RTT interval. 

It should be noted that the responsiveness of the TCP congestion avoidance algorithm is measured 
in intervals of the RTT, and the overall intent of the algorithm is to reach a steady state where the 
sender injects a new segment into the network at the same point in time when the receiver accepts a 
segment from the network. 

It should be noted that this algorithm works most efficiently when the spacing between ACK 
packets as received by the sender matches the spacing between data packets as they were received 
at the remote end. It should also be noted that the algorithm works optimally when 
congestion-induced packet loss happens just prior to complete queue exhaustion. The intent is to 
signal packet loss through duplicate ACK packets, allowing the sender to undertake a fast 
retransmission and leave the congestion window at the ssthresh level. If queue exhaustion occurs, 
the TCP session will stop receiving ACK signals and retransmission will only occur after timeout, 
and at that point the congestion window is bought back to a single segment, and the slow start 
algorithm is recommenced. An equal performance problem is network congestion events which 
occur within very short time intervals compared to the RTT (as may be the case in a mixed traffic 
load across a common ATM transport substrate, where short ATM switch cell queues may lead to 
very short interval cell loss events). In this case, a TCP session may switch from the aggressive 
window expansion of slow start into a much slower window expansion of congestion avoidance at 
a traffic rate well below the true long term network availability level. 

One major drawback in any high-volume IP network is that when there are congestion hot spots, 
uncontrolled congestion can wreak havoc on the overall performance of the network to the point of 
congestion collapse. When a high volume of TCP flows are active at the same time, and a 
congestion situation occurs within the network at a particular bottleneck, each flow conceivably 
could experience loss at approximately the same time, creating what is known as global 
synchronization. Global refers to all TCP flows in a given network that traverse a common path. 
Global synchronization occurs when hundreds or thousands (or perhaps hundreds of thousands) of 
flows back off their transmission rates and revert to TCP slow start mode at roughly the same time. 
Each TCP sender detects loss and reacts accordingly, going into slow-start mode, shrinking its 
transmission window size, pausing for a moment, and then attempting to retransmit the data once 
again. If the congestion situation still exists, each TCP sender detects loss once again, and the 
process repeats itself over and over again, resulting in a network form of gridlock [6]. 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


247 



Uncontrolled congestion is detrimental to the network -- behavior becomes unpredictable, system 
buffers fill up, packets ultimately are dropped, and the byproduct is a large number of retransmits 
which could ultimately result in complete congestion collapse. 

3.3.2 Preferential congestion avoidance at intermediate nodes 

Van Jacobson discussed the basic methods of implementing congestion avoidance in TCP in 1988 
[7]. However, Jacobson’s approach was more suited for a small number of TCP flows, which is 
much less complex to manage than the volume of active flows in the Internet today. In 1993, Sally 
Floyd and Van Jacobson documented the concept of RED (Random Early Detection), which 
provides a mechanism to avoid congestion collapse by randomly dropping packets from arbitrary 
flows in an effort to avoid the problem of global synchronization and, ultimately, congestion 
collapsed [8], The principal goal of RED is to avoid a "queue tail drop" situation in which all TCP 
flows experience congestion at the same time, and subsequent packet loss, thus avoiding global 
synchronization. RED also attempts to create TCP congestion signals using duplicate ACK 
signaling, rather than through sender timeout, which in turn produces a less catastrophic rate 
backoff by TCP. RED monitors the mean queue depth, and as the queue begins to fill, it begins to 
randomly select individual TCP flows from which to drop packets, in order to signal the receiver to 
slow down [Figure 4]. The threshold at which RED begins to drop packets is generally configurable 
by the network administrator, as well as the rate at which drops occur in relation to how quickly the 
queue fills. The more it fills, the greater the number of flows selected, and the greater the number of 
packets dropped. This results in signaling a greater number of senders to slow down, thus resulting 
in a more manageable congestion avoidance. 



Queue depth 

Figure 4 

The RED approach does not possess the same undesirable overhead characteristics as some 
non-FIFO (First In, First Out) queuing techniques (e.g., simple priority queuing, class based 
queuing, weighted fair queuing). With RED, it is simply a matter of who gets into the queue in the 
first place — no packet reordering or queue management takes place. When packets are placed into 
the outbound queue, they are transmitted in the order in which they are queued. Queue-based 
scheduling mechanisms, such as priority, class-based, and weighted-fair queuing, however, require 
a significant amount of computational overhead due to packet reordering and queue management. 
RED requires much less overhead than these non-FIFO queuing mechanisms, but then again, RED 
performs a completely different function. 


248 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 




RED can be said to be fair - it chooses random flows from which to discard traffic in an effort to 
avoid global synchronization and congestion collapse, as well as to maintain equity in which traffic 
actually is discarded. Fairness is all well and good, but what is really needed for differentiated QoS 
structures is a tool that can induce unfairness - a tool that can allow the network administrator to 
predetermine what traffic is dropped first (or last, as the case may be) when RED starts to select 
flows from which to discard packets. You can’t differentiate services with fairness. 

There are several proposals in the IETF which have suggested using the IP precedence subfield of 
the TOS (Type of Service) byte contained in the IP packet header to indicate the relative priority, or 
discard preference, of packets and to indicate how packets marked with these relative priorities 
should be treated within the network. As precedence is set or policed when traffic enters the 
network (at ingress), a weighted congestion avoidance mechanism implemented in the core routers 
determines which traffic should be discarded first when congestion is anticipated due to 
queue-depth capacity. The higher the precedence indicated in a packet, the lower the probability of 
discard. The lower the precedence, the higher the probability of discard. When the congestion 
avoidance is not actively discarding packets, all traffic is forwarded with equity. 

Of course, for this type of operation to work properly, an intelligent congestion-control mechanism 
must be implemented on each router in the transit path. A least one currently deployed mechanism 
is available that provides an unfair, or weighted, behavior for RED. This deviation of RED yields 
the desired result for differentiated traffic discard in times of congestion and is called Weighted 
Random Early Detection (WRED). A similar scheme, called enhanced RED, is documented in a 
paper authored by Feng, Kandlur, Saha, and Shin [9]. 

3.3.3 Scheduling algorithms to implement differentiated service 

There are other approaches to implement differential QoS within the network which use the routers 
queuing algorithm (or scheduling discipline ) as the enabling mechanism. Considering that queuing 
is perhaps the optimal point to introduce QoS differentiation mechanisms, this is an area of 
considerable interest. 

FIFO Queuing 

The base of best effort, single quality, network environments is that of a FIFO queue, where there is 
no inherent differentiation undertaken by the router’s transmission scheduler. Every packet which is 
scheduled to be transmitted on an output interface must await all previously scheduled packets 
before transmission. All such packets occupy slots in a single per-interface queue, and when the 
queue fills, all subsequent packets are discarded until the queue becomes available once more. As 
with the basic RED algorithm, this is a fair algorithm, given that it allocates the transmission 
resource fairly and imposes the same delay on all queued packets. For differentiated service levels, 
it is necessary to alter this fairness and introduce mechanisms to trigger preferential outcomes for 
classes of traffic. 

The basic modification of the single level FIFO algorithm to enable differentiated QoS is to divide 
traffic into a number of categories, and then provide resources to each category in accordance with 
a predetermined allocation structure, implementing some form of proportional resource allocation. 

Of course, the Law of Conservation holds here, such that the sum of the mean queuing delays per 
traffic category, weighted by their share of the resources they receive, is limited, with the corollary 
that in reducing the mean queuing delay for one category of traffic will result in the increase in 
mean queuing delay for one or more of the remaining categories of traffic [10], Accordingly, you 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


249 



can’t improve the performance profile of one class of traffic without adversely affecting the 
performance profile of one or more of the other classes of traffic, and the level of degradation will 
be similar in quantity to the level of improvement that was effected. 

Priority Queues 

A basic modification to the base FIFO structure is to create a number of distinct queues for each 
interface and associate a relative priority level with each. Packets are scheduled from a particular 
priority queue in FIFO order only when all queues of a higher priority are empty. In such a model, 
the highest priority traffic receives minimal delay, but all other priority levels may experience 
resource starvation if the highest precedence traffic queue remains occupied. To ensure that all 
traffic receives some level of service, it is a requirement that the network use admission policies 
which restrict the amount of traffic which is admitted at each elevated priority, or that the 
scheduling algorithm is adjusted to ensure that every priority class receives some minimum level of 
resource allocation. Accordingly, this simple priority mechanism does not scale well, although it 
can be implemented with relatively little cost, and more sophisticated (and more robust) scheduling 
algorithms are required within the Internet for QoS support. 

Generalized Processor Sharing 

The ideal approach is to associate a relative weight (or precedence) with each individual traffic flow 
and at every router, segment each traffic flow into an individual FIFO queue, and configure the 
scheduler to service all queues in a bit-wise round-robin fashion, allocating service to each flow in 
accordance with the relative weight. This is an instance of a Generalized Processor Sharing (GPS) 
discipline [11 - pp. 234-236]. 

Weighted Round-Robin and Deficit Weighted Round-Robin 

Various scheduling techniques can approximate this model. A basic approach is to use a packet’s 
marked precedence to place the packet into a precedence-based queue, and then use a weighted 
round-robin scheduling algorithm to service each queue. If all packets are identically sized, this is a 
relatively good approximation of GPS, but when packet sizes vary, this algorithm can exhibit 
significant deviation from a strict relative resource allocation strategy. This can be partially 
addressed using a deficit weighted round-robin [11 - pp. 237-238] algorithm which modifies the 
round-robin algorithm to use a service quantum unit. A packet is scheduled from the head of a 
weighted queue only if the packet size minus the per-queue deficit counter is less than the weighted 
quantum value, and the next packet in the queue is tested using a weighted quantum value which 
has been reduced by the size of the scheduled packet. When the test fails, the remaining weighted 
quantum size is added to the per-queue deficit counter, and the scheduler moves to the next queue. 
This algorithm performs with an average allocation which corresponds to the relative weights of 
each queue, but still exhibits unfairness within time frames which are commensurate to the 
maximum packet service time. 

Weighted Fair Queuing 

Weighted Fair Queuing (WFQ ) [12] attempts to provide fairer resource allocation measures which 
protect "well behaved" traffic sources from uncontrolled sources. WFQ attempts to compute the 
finish time of each queued packet if a bit-wise weighted GPS scheduler had been used, and then 
schedules for service the packet with the smallest finish time which would’ve been receiving 
service in the corresponding GPS scheduler model. WFQ is both a scheduling and packet drop 
policy, where packet drop is based on a preference for dropping packets with the greatest finish 


250 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



time in response to an incoming packet which requires a queue slot. While WFQ requires a 
relatively complex implementation, it has a number of desirable properties. The scheduling 
algorithm does undertake fair allocation which does indeed ensure that different categories of traffic 
are not capable of resource starving other categories. The algorithm also bounds the queue delay 
per service category, which also provides a lever to create delay-bounded services without the need 
for resource reservation. 

3.3.4 Traffic shaping non-adaptive flows 

One of the more confounding aspects of providing differentiated services at the network and 
transport layers of the TCP/IP protocol suite is that of dealing with non-adaptive flows, or in other 
words, traffic flows which do not adapt their transmission rates in response to loss in the network. 
The most offensive of this category appear to be applications which use UDP as their transport 
protocol. This is somewhat ironic in that long-standing traditional thinking has assumed that 
applications which use UDP are generally thought to be designed to be "intelligent enough" to 
recognize loss, since UDP does not provide any error correction itself. The resulting observation is 
that this is a fundamentally flawed assumption, since it is generally recognized that applications 
which use UDP are generally not very "network friendly." 

Having said that, the subsequent action is to rate-shape UDP flows as they enter the network, 
limiting their transmission rate to a specified bit rate. This method is arguably a compromise - it’s 
not pretty, but we understand how to do this, and it works. There are a couple of proposed methods 
to enhance the basic RED mechanism to provide some relief in the face of non-adaptive flows; 
however, the validity and practicality of these schemes are still being evaluated. One such proposal 
is discussed in [13]. 

3.4 Integrated services and RSVP 

It has been suggested that the Integrated Services architecture [14] and RSVP [15] are excessively 
complex and possess poor scaling properties. This suggestion is undoubtedly prompted by the 
existence of the underlying complexity of the IP layer signaling requirements. However, it also can 
be suggested that RSVP is no more complex than some of the more advanced routing protocols. An 
alternative viewpoint might suggest that the underlying complexity is required because of the 
inherent difficulty in establishing and maintaining path and reservation state information along the 
transit path of data traffic. The suggestion that RSVP has poor scaling properties deserves 
additional examination, however, because deployment of RSVP has not been widespread enough to 
determine the scope of this assumption. 

As discussed in [16], there are several areas of concern about the wide-scale deployment of RSVP. 
With regard to concerns of RSVP scalability, the resource requirements (computational processing 
and memory consumption) for implementing RSVP on routers increase in direct proportion to the 
number of separate RSVP reservations, or sessions, accommodated. Therefore, supporting a large 
number of RSVP reservations could introduce a significant negative impact on router performance. 
By the same token, router-forwarding performance may be impacted adversely by the packet 
classification and scheduling mechanisms intended to provide differentiated services for reserved 
flows. These scaling concerns tend to suggest that organizations with large, high-speed networks 
will be reluctant to deploy RSVP in the foreseeable future, at least until these concerns are 
addressed. The underlying implications of this concern also suggest that without deployment by 
Internet service providers, who own and maintain the high-speed backbone networks in the Internet, 
the deployment of pervasive RSVP services in the Internet will not be forthcoming. 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


251 



Another important concern expressed in [16] deals with policy-control issues and RSVP. Policy 
control addresses the issue of who is authorized to make reservations and encompasses provisions 
to support access control and accounting. Although the current RSVP specification defines a 
mechanism for transporting policy information, it does not define the policies themselves, because 
the policy object is treated as an opaque element. Some vendors have indicated that they will use 
this policy object to provide proprietary mechanisms for policy control. At the time of this writing, 
however, the IETF has chartered a new working group, called the RSVP Admission Policy (rap) 
working group [17], to develop a simple policy-control mechanism to be used in conjunction with 
RSVP. There is ongoing work on this issue in this working group. Several mechanisms already 
have been proposed to deal with policy issues; however, it is unclear at this time whether any of 
these proposals will be implemented or adopted as a standard. 

The key recommendation contained in [16] is that given the current form of the RSVP 
specification, multimedia applications run within smaller, private networks are the most likely to 
benefit from the deployment of RSVP. The inadequacies of RSVP scaling, and lack of policy 
control, may be more manageable within the confines of a smaller, more controlled network 
environment than in the expanse of the global Internet. It certainly is possible that RSVP may 
provide genuine value and find legitimate deployment utility in smaller networks, both in the 
peripheral Internet networks and in the private arena, where these issues of scale are far less critical. 
Therein lies the key to successfully delivering QoS using RSVP. After all, the purpose of the 
Integrated Services architecture and RSVP is to provide a method to offer quality of service, not to 
degrade the service quality. 

4.0 Conclusions 

A number of dichotomies exist within the Internet that tend to dominate efforts to engineer possible 
solutions to the quality of service requirement. Thus far, QoS has been viewed as a wide-ranging 
solution set against a very broad problem area. This fact often can be considered a liability. 

Ongoing efforts to provide "perfect" solutions have illustrated that attempts to solve all possible 
problems result in technologies that are far too complex, have poor scaling properties, or simply do 
not integrate well into the diversity of the Internet. By the same token, and by close examination of 
the issues and technologies available, some very clever mechanisms are revealed under close 
scrutiny. Determining the usefulness of these mechanisms, however, is perhaps the most 
challenging aspect in assessing the merit of any particular QoS approach. 

In the global Internet, however, it becomes an issue of implementing QoS within the most common 
denominator — this is clearly the TCP/IP protocol suite — because a single link-layer media will 
never be used pervasively end-to-end across all possible paths. What about the suggestion that it is 
certainly possible to construct a smaller network of a pervasive link-layer technology, such as 
ATM? Although this is certainly possible in smaller private networks, and perhaps in smaller 
peripheral networks in the Internet, it is rarely the case that all end-systems are ATM-attached, and 
this does not appear to be a likely outcome in the coming years. In terms of implementing visibly 
differentiated services based on a quality metric, using ATM only on parts of the end-to-end path is 
not a viable answer. The ATM subpath is not aware of the complete network layer path, and it does 
not participate in the network or transport layer protocol end-to-end signaling. 

The simplistic answer to this conundrum is to dispense with TCP/IP and run native cell-based 
applications from ATM-attached end-systems. This is certainly not a realistic approach in the 
Internet, though, and chances are that it is not very realistic in a smaller corporate network, either. 
Very little application support exists for native ATM. Of course, in theory, the same could have 
been said of Frame Relay transport technologies in the recent past, and undoubtedly will be claimed 


252 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



of forthcoming transport technologies in the future. In general, link layer technologies are similar to 
viewing the world through plumber’s glasses -- every communications issue is seen in terms of 
point-to-point bit pipes. Each wave of transport technology attempts to add more features to the 
shape of the pipe, but the underlying architecture is a constant perception of the communications 
world as a set of one-on-one conversations, with each conversation supported by a form of singular 
communications channel. 

One of the major enduring aspects of the communications industry is that no such thing as a 
ubiquitous single link layer technology exists. Hence, there is an enduring need for an 
internetworking end-to-end transport technology that can straddle a heterogeneous link layer 
substrate. Equally, there is a need for an internetworking technology that can allow differing 
models of communications, including fragmentary transfer, unidirectional data movement, 
multicast traffic, and adaptive data flow management. 

This is not to say that ATM itself, or any other link layer technology for that matter, is not an 
appropriate technology to install into a network. Surely, ATM offers high-speed transport services, 
as well as the convenience of virtual circuits. However, what is perhaps more appropriate to 
consider is that any particular link layer technology is not effective insofar as providing QoS in the 
Internet for reasons that have been discussed thus far. 

To quote a work in progress from the Internet Research Task Force, "The advantages of [the 
Internet Protocol’s] connectionless design, flexibility and robustness, have been amply 
demonstrated. However, these advantages are not without cost — careful design is required to 
provide good service under heavy load." [18]. Careful design is not exclusively the domain of the 
end-system’s protocol stack, although good end-system stacks are of significant benefit. Careful 
design also includes consideration of the mechanisms within the routers that are intended to avoid 
congestion collapse. Differentiation of services places further demands on this design, because in 
attempting to allocate additional resources to certain classes of traffic, it is essential to ensure that 
the use of resources remains efficient, and that no class of traffic is totally starved of resources to 
the extent that it suffers throughput and efficiency collapse. 

For QoS to be functional, it appears to be necessary that all the nodes in a given path behave in a 
similar fashion with respect to QoS parameters, or at the very least, do not impose additional QoS 
penalties other than conventional best effort into the end-to-end traffic environment. The sender (or 
network ingress point) must be able to create some form of signal associated with the data that can 
be used by downstream routers to potentially modify their default outbound interface selection, 
queuing behavior, and/or discard behavior. 

The insidious issue here is attempting to exert "control at a distance." The objective in this QoS 
methodology is for an end-system to generate a packet that can trigger a differentiated handling of 
the packet by each node in the traffic path, so that the end-to-end behavior exhibits performance 
levels in line with the end-user’s expectations and perhaps even a contracted fee structure. 

This control-at-a-distance model can take the form of a "guarantee" between the user and the 
network. This guarantee is one in which, if the ingress traffic conforms to a certain profile, the 
egress traffic maintains that profile state, and the network does not distort the desired characteristics 
of the end-to-end traffic expected by the requestor. To provide such absolute guarantees, the 
network must maintain a transitive state along a determined path, where the first router commits 
resources to honor the traffic profile and passes this commitment along to a neighboring router that 
is closer to the nominated destination and also capable of committing to honor the same traffic 
profile. This is done on a hop-by-hop basis along the transit path between the sender and receiver, 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


253 



and yet again from receiver back to sender. This type of state maintenance is viable within 
small-scale networks, but in the heart of large-scale public networks such as the global Internet, the 
cost of state maintenance is overwhelming. Because this is the mode of operation of RS VP, this 
presents some serious scaling considerations and is inappropriate for deployment in large networks. 

RSVP scaling considerations present another important point, however. RSVP’s deployment 
constraints are not limited simply to the amount of resources it might consume on each network 
node as per-flow state maintenance is performed. It is easy to understand that as the number of 
discrete flows increases in the network, the more resources it will consume. Of course, this can be 
somewhat limited by defining how much of the network’s resources are available to RSVP — 
everything in excess of this value is treated as best-effort. What is more subtle, however, is that 
when all available RSVP resources are consumed, all further requests for QoS are rejected until 
RSVP-allocated resources are released. This is similar in functionality to how the telephone system 
works, where the network’s response to a flow request is commitment or denial, and such a service 
does not prove to be a viable method to operate a data network where better-than-best-effort 
services arguably should always be available. 

The alternative to state maintenance and resource reservation schemes is the use of mechanisms for 
preferential allocation of resources, essentially creating varying levels of best-effort. Given the 
absence of end-to-end guarantees of traffic flows, this removes the criteria for absolute state 
maintenance, so that "better-than-best-effort" traffic with classes of distinction can be constructed 
inside larger networks. Currently, the most promising direction for such better-than-best-effort 
systems appears to lie within the area of modifying the network layer queuing and discard 
algorithms. These mechanisms rely on an attribute value within the IP packet’s header, so these 
queuing and discard preferences can be made at each intermediate node. First, the ISP’s routers 
must be configured to handle packets based on their IP precedence level, or similar semantics 
expressed by the bit values defined in the IP packet header. There are three aspects to this. First, 
you need to consider the aspect of using the IP precedence field to determine the queuing behavior 
of the router, both in queuing the packet to the forwarding process and queuing the packet to the 
output interface. Second, consider using the IP precedence field to bias the packet discard processes 
by selecting the lowest precedence packets to discard first. Third, consider using any priority 
scheme used at Layer 2 that should be mapped to a particular IP precedence value. 

Several methods have been proposed within the IETF which may yield robust mechanisms and 
semantics for providing these types of differential services ( diffserv ) [19]. 

The generic diffserv deployment environment assumes that the network uses ingress traffic 
policing, where traffic passed into the network is passed through traffic shaping profile 
mechanisms, which bind their average and peak data rates, and their relative priority and discard 
precedence in accordance with the traffic profile and the administrative agreement with the 
customer. These ingress filters can be configured to either discard out-of-profile packets, or the 
ingress filter may mark them with an elevated discard priority so that they are carried within the 
network only when there are adequate resources available. Within the core of the network, WFQ (or 
similar proportional scheduling algorithms) can be used to allocate network resources according to 
the marked priority levels, allowing the high speed and high volume switching component of the 
network to operate without per-flow state being imposed. 

The cumulative behavior of such stateless, local-context algorithms and corresponding deployment 
architectures can yield the capability of distinguished and predictable service levels, and hold the 
promise of excellent scalability. You still can mix best-effort and "better-than-best-effort" nodes, 
but all nodes in the latter class should conform to the entire QoS selected profile or a compatible 


254 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



subset (an example of the principle is that it is better to do nothing than to do damage). 

In conclusion, QoS is possible in the Internet, but it does come at a price of compromise - there are 
no perfect solutions. In fact, one might suggest that expectations have not been appropriately 
managed, since guarantees are simply not possible in the Internet, at least not for the foreseeable 
future. What is possible, however, is delivering differentiated levels of best effort traffic in a 
manner which is predictable, fairly consistent, and which provides the ability to offer discriminated 
service levels to different customers and to different applications. 

5.0 References 

[1] IETF Internet Draft, "End-to-End Traffic Issues in IP/ATM Internetworks," 
draft-jagan-e2e-traf-mgmt-00.txt, S. Jagannath, S. Yin, August 1997. 

[2] "MAC Bridges," ISO/IEC 10038, ANSI/IEEE Std. 802. ID, 1993. 

[3] "Supplement to MAC Bridges: Traffic Class Expediting and Dynamic Multicast Filtering," 
IEEE P802.1p/D6, May 1997. 

[4] IETF Internet Draft, "Integrated Service Mappings on IEEE 802 Networks," 
draft-ietf-issll-is802-svc-mapping-01.txt, M. Seaman, A. Smith, E. Crawley, November 1997. 

[5] IETF Internet Draft, "SBM (Subnet Bandwidth Manager): A Proposal for Admission Control 
over IEEE 802-style Networks," draft-ietf-issll-is802-bm-05.txt, R. Yavatkar, F. Baker, D. 
Hoffman, Y. Bemet, November 1997. 

[6] "Oscillating Behavior of Network Traffic: A Case Study Simulation," L. Zhang, D. Clark, 
Internetwork: Research and Experience, Volume 1, Number 2, John Wiley & Sons, 1990, pages 
101 - 112 . 

[7] "Congestion Avoidance and Control," V. Jacobson, Computer Communication Review, vol. 18, 
no. 4, pp. 314-329, Aug. 1988. 

[8] "Random Early Detection Gateways for Congestion Avoidance," S. Floyd, V. Jacobson, 
IEEE/ACM Transactions on Networking, v.l, n.4, August 1993. 

[9] "Understanding TCP Dynamics in an Integrated Services Internet," W. Feng, D. Kandlur, D. 
Saha, K. Shin, NOSSDAV ’97, May 1997. 

[10] "Queuing Systems, Volume 2: Computer Applications," L. Kleinrock, Wiley Interscience, 
1975. 

[11] "An Engineering Approach to Computer Networking," S. Keshav, Addison-Wesley, 1997. 

[12] "Design and Analysis of a Fair Queuing Algorithm," A. Demera, S. Keshav, and S. Shenker, 
ACM SIGCOMM’89, Austin, September 1989. 

[13] "Dynamics of Random Early Detection," D. Lin and R. Morris (Harvard University), a 
proposal & overview of Fair Random Early Drop (FRED). Presented at ACM SIGCOMM 1997. 

[14] RFC1633, "Integrated Services in the Internet Architecture: An Overview," R. Braden, D. 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 


255 



Clark, S. Shenker, June 1994. 


[15] RFC2205, "Resource ReSerVation Protocol (RSVP) - Version 1 Functional Specification," R. 
Braden, L. Zhang, S. Berson, S. Herzog, S. Jamin, September 1997. 

[16] RFC2208, "Resource ReSerVation Protocol (RSVP) Version 1 Applicability Statement, Some 
Guidelines on Deployment," A. Mankin, F. Baker, R. Braden, S. Bradner, M. O’Dell, A. Romanow, 
A. Weinrib, L. Zhang, September 1997. 

[17] The IETF RSVP Admission Policy (rap) working group charter can be found at: 
http://www.ietf.org/html.charters/rap-charter.html 

[18] Internet Research Task Force draft, "Recommendations on Queue Management and 
Congestion Avoidance in the Internet," draft-irtf-e2e-queue-mgt-recs.ps, R. Braden, D. Clark, J. 
Crowcroft, B. Davie, D. Estrin, S. Floyd, V. Jacobson, G. Minshall, C. Partridge, L. Peterson, K. 
Ramakrishnan, S. Shenker, J. Wroclawski, L. Zhang, March 1997. 

[19] Differential Service for the Internet, http://diffserv.lcs.mit.edu/ 

Acknowledgments 

Portions of text in this paper have been extracted from Quality of Service: Delivering QoS on the 
Internet and in Corporate Networks, by Paul Ferguson and Geoff Huston, published by John Wiley 
& Sons, January 1998, ISBN 0-471-24358-2. 


256 


Quality of Service in the Internet: Fact, Fiction, or Compromise? 



MARSHNet: A Multiservice Asynchronous Transfer Mode 

Wide-Area Network 


Shaun W Amy and James D Argyros 

CSIRO Telecommunications and Industrial Physics 
PO Box 76, Epping, 2121, Australia 

Shaun.Amy@tip. CSIRO. AU 
Jim. Argyros@tip.CSIRO.AU 

2 August, 1998 


Abstract 

For a number of years, Asynchronous Transfer Mode (ATM) networks have held the 
promise of being able to provide high-bandwidth integration of data, video and voice 
services with a guaranteed Quality of Service (QoS). Until recently, most networks which 
have implemented ATM have done so for the sole purpose of providing high bandwidth 
carriage of TCP/IP datagrams with little concern for QoS requirements. 

In this paper, we describe the design, implementation and operation of a high-bandwidth 
wide-area network using ATM to satisfy both production and research networking require¬ 
ments. This network, known as MARSHNet, connects four CSIRO sites in Sydney via 
34 Mbit/s (E3) microwave links. 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 


257 



Introduction 


CSIRO has belonged to the Internet community from the earliest days of the Internet in 
Australia. It was left to local professionals at each of the sites to arrange for connectivity and 
the range of services to be offered. This led to an ad hoc approach to networking with some 
sites demanding and getting better services than other sites. This is reflected in the different 
ways the various sites involved in the project were connected to the Internet. 

The wide area network to be replaced connected four CSIRO sites, all located within 
easy reach of each other, into the NSW Regional Network Organisation (RNO) located at 
the University of Technology (UTS). A 34Mbit/s microwave link using ATM has provided 
CSIRO’s connection (located at the CSIRO Riverside Corporate Park, North Ryde) to the 
NSW RNO since late 1995. The other sites then were connected via the Riverside Park 
router into the NSW RNO and hence the Internet. The CSIRO divisions on the Macquarie 
University (MQU) campus were connected via the CSIRO National Measurement Laboratory 
(NML) at Lindfield to the Riverside Park router using 10 Mbit/s microwave ethernet. The 
CSIRO Radiophysics Laboratory (RPL) was connected to the Riverside Park router via a 
2 Mbit/s synchronous serial service. 


CSIRO TIP/ATNF (RPL) 


CSIRO TIP/DMT (NML) 




2 Mbit/s Megalink r 

Service !"'• 


Microwave to 
CSIRO NML 
(10+2 Mbit/s) 




Microwave to 
NSW RNO (UTS) 

CSIRO Riverside Park (N Ryde) 


''Z-On 


Microwave to 
CSIRO Riverside Park 
(10+2 Mbit/s) 



Microwave to 
CSIRO MQU 
(10+2 Mbit/s) 




-1 

r(<^ 


1 

Microwave to 1 
CSIRO NML | 
(10+2 Mbit/s) | 



Cisco 

AGS+/4 Router 

1 

. E’net 

CSIRO MIS 

Macquarie University (MQU) 


Figure 1: The pre-MARSHNet CSIRO Wide-Area Network 


The proposal to upgrade the network developed from the conjunction of the following 
different influences: 

1. Some monies becoming available from the CSIRO corporate purse. 

2. Approaching end-of-life and increasing unreliability of the 10 Mbit/s microwave network 
components. 

3. A re-organisation within CSIRO which led to the Division of Applied Physics (located 
at NML, Lindfield) and the Division of Radiophysics (located at RPL, Marsfield) be- 


258 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 










coming a single division (CSIRO Telecommunications and Industrial Physics) with a 
consequential desire to increase the “closeness” of the two sites. 

4. The existence of a network research group at RPL with an interest in QoS issues asso¬ 
ciated with ATM. 

Initial studies into the feasibility of providing these services began in late-August 1996. 
The project became known as MARSHNet, the Marsfield Area Research and Scientific 
High-bandwidth Network. 

The Proposal 

A two-fold need was identified for the data network: in addition to the provision of a high- 
bandwidth and reliable production network environment between the four sites, an ability 
for various telecommunications research groups to conduct experiments between the sites was 
paramount. At the Radiophysics Laboratory, various research projects on ATM had been 
undertaken and collaborations existed with groups located at CSIRO Macquarie University 
and the University of Technology, Sydney. 

The following requirements needed to be satisfied by the MARSHNet project: 

1. The lifetime of the project should be a minimum of 3 years before any significant 
upgrades are required. 

2. Voice services within the CSIRO divisions involved in the project must be carried across 
the new infrastructure replacing, in the case of RPL, the 5 ISDN SPCs that connected 
RPL to CSIRO Riverside Park. 

3. Where possible, recurrent costs should be minimised, although clearly equipment needs 
to be maintained and any on-going licenses need to be funded. 

4. The minimum bandwidth was to be 34 Mbit/s (E3) and it was to be an ATM network. 

5. The network must support both production and research work. The research require¬ 
ments will dictate the standards for the ATM switches and the way the virtual circuit 
infrastructure is configured. 

6. Clearly, neither the production nor research requirements should cause enhanced per¬ 
formance for one at the expense of the other. 

7. For maximum flexibility the ability for CSIRO to setup and clear down virtual circuit 
calls across the ATM infrastructure (end-to-end) with the appropriate ATM service 
class is highly desirable. 

8. In the event of a link failure clearly defined service level agreements are in place to 
ensure a speedy restoration of service. Similarly, backup links should be clearly defined 
and able to be brought up with minimal configuration issues and preferably with no 
manual intervention. 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 


259 



The Physical and Logical Networks 

The options that were available for the physical network were somewhat limited: a private 
microwave service or a commercially available network service (whether that be “dark fibre”, a 
high-bandwidth “pipe” or commercial ATM service) offered by a telecommunications carrier. 
If a commercial offering were to be chosen, then it was made clear by the vendor that the 
service would need to be provisioned on optical fibre. Although a fibre offering is an attractive 
concept, the costs, particularly the recurrent costs, were prohibitive. Consequently, a private 
microwave service became the basis of this network. 

The radio system that was chosen for the microwave links operates in the 38 GHz band 
and is a combined 34 Mbit/s + 2 Mbit/s (E3 + El) system. This frequency was chosen so that 
there was no interference with existing radio research projects at RPL and the substantially 
lower recurrent spectrum licensing costs. This has the advantage that the El channel can be 
used for intelligent PABX internetworking without the additional complexity of implementing 
2 Mbit/s circuit emulation service over an ATM infrastructure. The data network consists of 
ATM switches and routers at each of the four sites. 

Figure 2 shows the physical network on completion of the project. It is a star topology 
centred at NML. For maximum redundancy we would have preferred a fully meshed network 
but the contours of the land and cost prevented this. To provide a measure of redundancy for 
data services the Megalink between RPL and Riverside Park has been maintained. It is not 
totally satisfactory in that not all sites receive equal benefit from this additional link. Voice 
services, will re-route to the public network in the event of a microwave link failure although 
any intelligent PABX networking features are not available in this case. 

It will be noticed that the option exists at each site to operate independently of the 
routers in an ATM only mode. While it was required that the routers implement CSIRO 
policy (integrity of routing tables and provision of packet-filtering using access-control lists), 
it was also felt to be important to allow the use of any equipment that required a pure 
ATM environment. This has been used already to provide a high-quality video conferencing 
capability between Marsfield and Lindfield using commercially available video codecs which 
convert a composite video signal directly into ATM cells. 

Two logical networks were implemented to co-exist on the physical network just described: 
the production network and the research network. Each of the logical networks became 
identified with separate virtual networks (Liu, Richards Sz Rogers, 1998) during configuration 
of the switches and routers. It was intended that on a heavily loaded network each virtual 
network would be guaranteed a minimum bandwidth of approximately 16 Mbit/s. However 
if one virtual network is lightly loaded, then the other virtual network should be able to use 
the unused bandwidth if required. 

Note however that by defining separate virtual networks in this way, when traffic needs to 
be switched within any one of the virtual networks, it requires its own virtual switch which 
at some point has to be identified with a physical switch. 

The Virtual Path Networks 

A virtual path (VP) represents a common route through a switch for a set of virtual circuits 
(VC). 

It becomes necessary to design VP network structures so that research traffic and produc¬ 
tion traffic flow into their own VPs. Each of the VPs is assigned an allocation of bandwidth 
ensuring that each virtual network receives its share of bandwidth unaffected by the other 


260 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 



CSIRO TIP/ATNF (RPL) 

Connection to CSIRO TIP 
ATM Research Laboratory 


ATM I I STM-1 

Cisco ATM C|gc0 

Catalyst 5500 - 7206 Router 

(with ASP) STM-1 


- DPNSS 
El (PABX) 


CSIRO TIP/DMT (NML) 


On Microwave to 

CSIRO Riverside Park 
j n (34 Mbit/s) 


Microwave to 
CSIRO NML 
(34 Mbit/s) 



DPNSS(3) 


ATMI 

1E3 


El (PABX)-1 






ATM 

Cli 

(CO 

ATM 

Microwave to 

E3 

LS-1010 

STM-1 

CSIRO RPL 


1 

1 


(34 Mbit/s) 


ATM 

E3 



Cisco 

7206 Router 


Microwave to 
CSIRO MQU (34 Mbit/s) 


DPNSS . 
El (PABX) 


Microwave to 
CSIRO NML 


ATM E3 < 34Mblt/s ) 


Cisco 

ATM 

Cisco 

7507 Router 

E3 

LS-1010 


Microwave 
NSW RNO 
(34 Mbit/s) 


» K- 

(UTS) 


CSIRO Riverside Park (North Ryde) 



Microwave to 
CSIRO NML (34 Mbit/s) 

|_ DPNSS 

ATM 1 | E3 El (PABX) 

Cisco ATM C i sco I. 

Catalyst 5500 - 7206 Router 

(with ASP) STM-1 I I 


CSIRO MIS 

Macquarie University (MQU) 


Figure 2: The MARSHNet CSIRO Wide-Area Network 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 


261 











virtual network. Here we are attempting to use QoS features offered by ATM to partition the 
bandwidth. To do this it is necessary that the traffic be translated into the ATM Constant Bit 
Rate (CBR) service class. Then the CBR tag option is enabled on the virtual paths so that 
traffic cells from one virtual network, which is using bandwidth beyond its allocated share, 
axe tagged, resulting in a lower priority when congestion occurs. This mechanism allows each 
virtual network to use bandwidth not being used by the other while ensuring each virtual 
network can easily reclaim its bandwidth share by discarding the tagged cells. 

The physical MARSHNet topology is a star centred at NML. The production VP network 
has been configured to follow this star structure. The same structure cannot also be adopted 
for the research VP network because the virtual switch concept introduced in the previous 
section has an inherent weakness. Each virtual switch requires a separate routing table. 
The physical switches that have been used support only one routing table. Consequently it 
is not possible to route traffic within a virtual network for multiple virtual networks on a 
single physical switch. Because the research network at RPL uses ATM switches, the cleanest 
solution was to move the hub of the research virtual network into one of the research switches. 

The VP network structure is summarised in the following table: 


Table 1: The Allocation of ATM Virtual Paths within MARSHNet 


Site Virtual Path Network 


RPL 

1 

Production 

UTS 

3 

Production 

MQU 

5 

Production 

NML 

2 

Research 

UTS 

4 

Research 

MQU 

6 

Research 


Because the VPs funnel the traffic to the hub for switching, there is no VP associated 
with the hub site. 

While this solved the problem for traffic to and from workstations at the RPL research 
centre, it does not solve it for traffic between workstations at the other sites where the extra 
path lengths in the research network ensure that the production paths would be chosen under 
normal conditions. The only way of guaranteeing use of the research network is to use 
permanent virtual circuits (PVC) or to deploy additional research switches at the other sites 
so that the VP do not need to be terminated in the production switches. 


The Production Network 

Each site connected to MARSHNet has an existing LAN. Routers at each of the sites provide 
the interface between the LANs and MARSHNet. It was required that all IP traffic destined 
for MARSHNet pass through a CSIRO State Hub router (in order to implement policy) 
located at Riverside Park. Through traffic is then switched to the correct destination across 
the ATM mesh. The link to the NSW RNO was moved from being directly connected to the 
router to an ATM switch, thus effectively “front-ending” the router with a switch. 

Figure 3 shows the production network emphasising the VP structure used. The most 
obvious feature of this diagram is that each of the switches has an external loopback. The 


262 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 



reason that this is needed is that the switches used do not allow network traffic policing 
to occur just after the switching function. By implementing the loopback, the VPs can be 
policed in the same switch that is used to switch the VCs (Liu, Richards & Rogers, 1998). 



UTS 

Figure 3: The MARSHNet Production Network ATM Virtual Path Configuration 


The routers could be joined together in different ways. The method chosen was to use 
the ATM Forum’s LAN Emulation protocol, LANE (Dutton ii Lenhard, 1995). This requires 
running various LANE servers (e.g. LECS/LES/BUS) within the network, on one of the 
ATM connected devices (e.g. an ATM connected router or ATM switch). This introduces a 
single point of failure. Because of the star topology of the network, centred at NML, it was 
decided that providing these services in the ATM switch located at NML was appropriate. 
Experience showed a measure of redundancy was desirable and this was implemented using 
a vendor specific solution (Cisco Systems, 1997). 

At the time of the initial configuration only LANE version 1 had been standardised which 
uses the ATM Forum’s Unspecified Bit Rate (UBR) service category. 


The Research Network 

Figure 4 shows the research network with its VP structure. Note especially how the VP are 
taken out of the production environment and terminated in a research switch. Once again, 
the external loopbacks to allow traffic policing are prominent. The routers which are essential 
for the production network have been taken out of play for the research network. 

It was not intended ever that traffic over the research network be limited to traffic which 
fits within the LANE profile. Rather it was intended that the research network make full 
use of other traffic categories; in particular, CBR and Variable Bit Rate (VBR). As both of 
these last mentioned traffic categories are given a higher priority through the ATM switches, 
then the experimental traffic on the research network could easily block the production UBR 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 


263 




















UTS 

Figure 4: The MARSHNet Research Network ATM Virtual Path Configuration 


traffic. This was exactly the opposite effect that was needed, namely that if it be required 
the production traffic should invade the space of the research traffic. 

It was for this reason that the VP structure described previously was developed. It is 
there that ATM QoS is being used to partition the bandwidth fairly and efficiently between 
the research network and the production network. 


IP over ATM 

To test the performance of IP over this network, tests were run using the Netperf benchmark 
(Hewlett-Packard, 1995) to measure both TCP and UDP throughput. For the results reported 
below, the send and receive socket buffer size of approximately 16 kB was used. Immediately 
a problem was observed which can be seen in Figure 5. 

When the TCP/IP throughput was studied for various message sizes, the performance 
achieved was reasonable considering protocol overheads (TCP/IP/LANE/AAL5/ATM/E3) 
while the message size was small. As expected, a nearly linear increase was observed when the 
message was smaller than the cell size. The surprise came when, shortly after messages have 
to be fragmented into cells, the throughput collapsed. Analysis of the problem (Liu, Richards 
Sz Rogers, 1998) has yielded the following explanation: the switch CBR output buffer size at 
the E3 (34 Mbit/s) port which is set by default to 256 cells was too small to buffer bursts 
from a source connected to the switch through an OC3 (155 Mbit/s) port. If the TCP/IP 
maximum window size (64 kB) is used, a bottleneck buffer of more than 1066 cells would be 
required. Multiple connections competing for buffer space at the bottleneck would indicate 
that configuring a somewhat larger buffer would be prudent. The figure chosen was 40960 
cells. After this change was made, an identical test was run and the results shown in Figure 6. 
The throughput with the larger buffer flattens at around 28 Mbit/s which is acceptable. 

UDP/IP performance is also shown for each case and although slightly improved in the 


264 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 


























Figure 5: TCP and UDP Performance measured using the Netperf Benchmark before the 
switch buffers were increased (see text) 


second case, performance is still poor once the transmitted traffic exceeds the effective band¬ 
width of the link. We suggest this is the expected behaviour for UDP traffic (see, for example, 
Stevens (1994)). 

It is worth noting that the ATM router interfaces do not support traffic shaping thus the 
TCP performance shown is reliant upon the protocol providing the flow control between the 
transmitter and receiver. UDP does not offer such a mechanism. 

Performance 

The following checks were carried out on the performance of the network. 

1. Using pairs of computers connected directly into the ATM network, it was shown that 
the VP associated with the production or research network was chosen correctly by 
SVC applications according to the production or research environment to which they 
belonged. 

2. It was demonstrated clearly (see Liu, Richards & Rogers (1998)) that the traffic inter¬ 
ference between the two virtual networks occurred as predicted by the network design. 

3. Prom the time that MARSHNet became operational (23 December 1997), there have 
been three occasions when the microwave link between Marsfield and Lindfield have 
failed. In all cases, the cut-over from the microwave link to the Megalink and the 
cut-back when the microwave service was restored happened automatically “with no 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 


265 





Figure 6: TCP and UDP Performance measured using the Netperf Benchmark after the 
switch buffers were tuned (see text) 


manual intervention”, relying on the Layer 3 IP routing protocol to detect and react to 
link failure. 

4. The operation of IP over ATM has been demonstrated to perform at an acceptable 
level for TCP once the buffer sizes of the E3 ports had been increased substantially. 
UDP performance remains very poor. However increasingly many traditional UDP 
applications (e.g. NFS) are converting to use TCP which makes this much less of an 
issue. 


Conclusions 

MARSHNet went into production just before Christmas 1997. It has performed reliably since 
then, the only breaks to the service being caused by hardware failures or very heavy rain. The 
former were repaired quickly as per the “service level agreement” in place. No networking 
problems were attributable to configuration errors which highlights the importance of bench 
testing, which, fortunately, we were encouraged to do as it was recognized that some of the 
design concepts were novel. 

Apart from conventional data traffic on the production network, the research network 
regularly carries video and audio traffic in a video- conference scenario. The partitioning of 
the physical network into two virtual networks has been shown to work well. 

While it is hoped that we will get the three years out of the infrastructure as was planned 
originally, the usage graphs and the future applications planned (e.g. remote backups) make 
the existing bandwidth look barely adequate. Planning will have to begin soon for an upgrade 


266 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 







to, at least, part of the network to 155 Mbits/s. 


Acknowledgements 

The authors would like to thank Glynn Rogers, Ren Liu, Antony Richards and Terry Percival 
for their various contributions to the project, sometimes in meetings, sometimes in discus¬ 
sions and sometimes in network engineering. Financial support from CSIRO’s Information 
Technology Services is also acknowledged. 


References 

Dutton, H.J.R., & Lenhard, P. 1995, Asynchronous Transfer Mode (ATM) Technical 
Overview, 2nd ed., Chapter 11, Prentice Hall PTR, Upper Saddle River, New Jersey 

Configure Fault-Tolerant Operation in LightStream 1010 ATM Switch Software Configuration 
Guide, Cisco Systems, 1997 

Netperf: A Network Performance Benchmark, Revision 2. Information Networks Division, 
Hewlett-Packard Company, 1995 

Liu, R.P., Richards, A., & Rogers, G. 1998, MARSHNet, A Superimposed QoS Guaranteed 
Virtual Path Network , to be presented at IEEE GLOBECOM’98, Sydney, November 1998 

Stevens, W.R. 1994, TCP/IP Illustrated, Volume 1: The Protocols, Addison-Wesley 


MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 


267 



MARSHNet: A Multiservice Asynchronous Transfer Mode Wide-Area Network 



Implementation of a User-Pays Library Printing 

System: 
a Case Study 


Kay Darbyshire 

Dept, of Information Technology 
Victoria University of Technology 
Kay.Darbyshire@vut.edu.au 


Dr. Richard Jacewicz 
Dept, of Information Technology 
Victoria University of Technology 
Richard. Jacewicz @ vut.edu. au 


Abstract 

Traditionally, print services installed in a University environment were free of charge and were 
restricted to handling print requests from a particular network operating system. This presented 
many problems, including the waste of paper and duplication of resources. If a more open, 
user-pays system of printing could be designed, this would result in more efficient use of the 
available resources. 

This paper presents a case study of a design and implementation of such a user-pays scheme 
for a library printing service at Victoria University of Technology. With a limited budget, the 
goal was to design a print system that would allow students access to a laser printing service 
from a number of locations, and potentially using different network systems. 


Problem Description 

Victoria University of Technology offers its students a range of computer services, which 
frequently require document printing. These services include use of electronic CD-ROM 
databases, the Internet, as well as spreadsheeting and wordprocessing applications. Students 
have access to local Banyan Vines and TCP/IP networks via PCs located in laboratories and 
open access areas. As a result of funding constraints, only limited printing services were 
provided either via low quality printers attached to PCs or via usually better quality Banyan 
Vines printers located in some laboratories. 

The library has recently been required to implement a user-pays system for all students’ 
printing in the library. The available commercial solutions have been found to be either not 
matching the library functional requirements or too expensive in terms of the total cost. The 
alternative option, involving the development of a VUT own system, was perceived as a long 
process, posing significant costs and risk. This common perception has been revised in the 
context of a growing number of functionally open modules and, to a lesser degree, open 
hardware components. The design of a new system could therefore be simplified to the task of 
combining together available software and hardware components. It was expected that the 
simplicity of the overall task would go hand-in-hand with the functionality and “openness” of 
the joined components. This paper presents a case study of such an approach. 


Implementation of a User-Pays Library Printing System: a Case Study 


269 



System Requirements 


Requirements for the new printing service included: 

• fast, high quality printing for LAN users (Banyan Vines); 

• charging users via user ID-card; 

• charging users per each printed page; 

• easy print job identification; 

• convenient, keyboard-free user interface (GUI) facilitating viewing of the print queue, 
printing and/or removing a job; 

• compactness (job selection, charging and printing all carried out at one place); 

• minimum maintenance (auto-recovery and automatic management of the print queue); 

. compatibility with the “user-pays” photocopying system (already existing within the 
University); 

• reliable operation in open access areas; 

• quick implementation and low cost. 


Early Attempts 

The sought solution, named “Print Kiosk”, had to combine Banyan Vines LAN, laser printer, 
swipe card reader and the account database (a part of the existing photocopying system) into 
one printing system meeting all the specified requirements. Preliminary analysis showed that 
the printer required hardware modification to facilitate its cooperation with the swipe card 
reader (access to “page printed” and “paper out” lines). 

The first approach to the design of the new system was an adaptation of the setprint menu 
offered by the native Banyan Vines - it was found to be sensitive to malicious or accidental 
behaviour of users, not convenient for navigation and not providing users with a graphical 
interface. The limited options available could not customise the setprint program to meet the 
requirements. 

The second approach investigated the possibility of getting direct access to the Banyan Vines 
print queue and managing the jobs. As Banyan Vines server had been installed on the NT 
platform, after relaxing permissions on the printer spooler directory, a program run on NT 
could manage the contents of the queue and its index file jobs.1st. In this approach, the 
graphical interface would be provided by the WWW service, while the executive routines 
(listing jobs, selecting a job for printing, submitting the job to the printer) could be 
implemented as cgi programs (C, Perl). It was soon discovered, however, that the index file 
kept information about all jobs in the queue except the last one. It was also found that the 
direct operations on the queue remained “unnoticed” by the Banyan Vines print service and 
that manipulation of the jobs index file was not easily accomplished. 


Final Solution 

The restrictive environment of Banyan Vines was the main reason for moving towards a Unix 
platform. Print jobs submitted to the Banyan Vines queue were diverted via an NT print 


270 


Implementation of a User-Pays Library Printing System: a Case Study 




spooler to a remote network printer arranged on the Unix host. The Unix print queue was set 
up to accept print requests, the printer associated with it was however permanently disabled. 
This arrangement kept the spooled jobs for exclusive use by a managing program which, as a 
cgi program, was executed in the WWW environment. Both the WWW server and browser 
were installed on the same Unix host, so the WWW server had direct access to the print 
queue and the Print Kiosk menu was displayed on the host console. 

The concept of making the maximum use of functions offered by individual components has 
simplified the structure of the Print Kiosk system and significantly limited the programming 
task. The final configuration is shown in Figure 1. 


Implementation of a User-Pays Library Printing System: a Case Study 


271 




Banyan Vines / NT server 


print job 


spl086.tmp, jobs.lst 


print job number 


00026.spl, 00026.Shd 


V PC platform 


to Account 
Database 


iwsei 


paper out 1 


page printed' 


swipe card reader 


Figure 1 Print Kiosk structure 

In the context of the proposed Print Kiosk structure, the life cycle of a print job consists of the 
following phases (together with the components responsible for their execution listed in 
brackets): 

1. Job generation (PC application); 


272 


Implementation of a User-Pays Library Printing System: a Case Study 




























User workstations have been set up to use the print driver relevant to the printer used in 
the Print Kiosk system. The local printer set-up of all workstations must correspond to the 
characteristics of the target printer (e.g. selection of paper size). 

2. Job redirection to the Banyan Vines print queue (Banyan Vines print service)', 

Each workstation has its printer allocated to a Banyan Vines print queue. A generated 
print job goes to the Banyan Vines server, where it is stored in the spooler directory 
c:\Program FilesXBanyari\Print\Spool. The contents of the job is stored in a filename with 
a generic name spl<job number>.tmp (like spl086.tmp ) while all administrative 
information (job number, owner details, time received, size and status of the print job) is 
appended to the index file jobs. 1st. 

3. Job number notification (Banyan Vines print service)'. 

The Banyan Vines print spooler notifies the user what job number has been assigned to 
the print job. This is achieved by using a pop-up window on the workstation to display the 
information. The “job number” is the only identifier of the submitted job and will be used 
further on to select the job from the Print Kiosk console. 

4. Passing the job to the NT printer spooler (Banyan Vines print service, NT print 
provider)', 

The Banyan print spooler hands the print job to a Windows NT printer and at this point 
Windows NT assumes responsibility for the print job. The local print provider writes the 
job to a Windows NT spooler (as a file with the extension .spl), while the administrative 
information associated with the print job is written to a file with the extension .shd. Both 
files are placed in an NT directory %systemroot\System32\SpooNPrinters [2], 

5. Job redirection to a Unix print queue (NTprint service, Unix print service); 

The print provider sends the print job through the port monitor (using Line Printer 
Daemon Protocol [1]) to the remote queue on the Unix system. Afterwards any files 
associated with the print job are deleted from the NT spooler directory. The Banyan print 
spooler changes the status of the print job in the jobs.1st to printed, and after a period of 
time the administrative details of this job are removed from the Banyan Vines index file 
[3], 

The redirected print job comes to the Unix print queue and is stored in its spooler 
directory (e.g. /var/spool/lp/tmp/itdss) as two files: 
control file <request-ID>-0 and data file <request-ID>-l. 

The allocated <request-ID> number is not the job number issued by the Banyan Vines 
print service. The original job number is, however, still preserved in the control file [1] 
and can be recovered. 


6. Listing contents of the Unix print queue (cgi program, WWW browser)' 

By default, the browser points to a cgi program, /cgi-binAiosk, which is responsible for 
the user menu and the execution of requested operations as shown in Figure 2. Selecting 
“Update job list” refreshes the screen with the current list of spooled jobs. The original 
Banyan Vines job numbers are retrieved from the stored control files. 


Implementation of a User-Pays Library Printing System: a Case Study 


273 




Figure 2 Print Kiosk menu 


7. Reading and validating user’s ID (firmware of the swipe card reader); 

Before printing, a user has to swipe his ID card through the card reader. The account 
database is interrogated and the available credit (in terms of number of pages) is displayed 
on the card reader. At any stage the user may add funds to the account via an Autoloader 
located close to the Print Kiosk. After determining that the account is in credit, the card 
reader resets the “paper out” signal, which normally changes the printer status to 
“Online”. 

8. Selection of a job (WWW browser); 

The Print Kiosk menu lists all jobs using the original job numbers. With standard radio 
buttons on this page, the user can highlight a job for printing. 

9. Sending the job for printing ( cgi program); 

The selected “job number” is mapped to a <request-ID> number and the corresponding 
data file <request-ID>-l is sent to a printer port (e.g. /dev/lpl) 

10. Updating the status of the user’s account (firmware of the swipe card reader); 

Each printed page is accompanied by sending a signal to the card reader that updates the 
balance of the user’s account and the displayed credit. 


274 


Implementation of a User-Pays Library Printing System: a Case Study 









11. Deleting a job from the Unix print queue (cgi program; cron job)-, 

A job may be removed from the spooler either by user request (after selecting a job and 
the “Delete” pushbutton) or by a built-in expiry policy of the cgi program. The policy 
currently being used permits print jobs to stay in the queue for at least 1 hour but not more 
than 2 hours. 


Technical Details 

The graphical user interface has been based on the Netscape browser ran as a single window 
application, without a window manager. The boot up sequence of the Unix host starts up the 
WWW server (/etc/rc2.d/S89web), initialises the X-window environment 
(/etc/rc3.d/S90kiosk) and launches the WWW browser as a single X-window application 
(~kiosk/.xinitrc). In the latter file, all lines invoking an X-window manager should be replaced 
with a line starting up a WWW browser e.g. 

/opt/net304/netscape -geometry 800x600-0+0 http://kiosk.vut.edu.au/cgi-bin/kiosk 

The size declared in the “geometry” option should match the defined size of the screen. The 
presence of various toolbars and informative fields has been removed by setting up options in 
the browser’s preferences. After making these changes, the menu displayed on the Print Kiosk 
screen does not show any controls or graphical elements of the Netscape browser as Figure 2 
indicates. With the use of a mouse as the only input device exposed to the public, it is 
impossible to move or resize the window, to open a new window or to escape to any other 
program or utility. 

The managing program of the Print Kiosk (/cgi-binAiosk) generates the menu with the current 
numbers of spooled jobs displayed and offers 4 operations: “Update the job list”, “Print the 
job”, “Delete the job” and “Print and delete the job” (which is a shortcut for the sequence of 
two previous operations). The menu is an HTML document structured as a form, which again 
evokes /cgi-binAiosk after selecting one of the four available submit pushbuttons. 

The solution presented could be implemented in a variety of environments. The VUT 
implementation of Print Kiosk is based on the following software/hardware components: 

• PC workstations with Windows95 and Banyan Vines Network Client; 

. Microsoft Windows NT 4.0 Server with Banyan StreetTalk for Windows NT Version 8.5; 

• Xerox DocuPrint 4517 printer with PCL5 interpreter; 

• Unix (SunOS 5.5.1 i86pc) running Apache 1.2.4 WWW server and Netscape Navigator 
3.04; 

• Monitor CTM swipe card reader and the account database Monitor On-line System. 


Conclusions 

The concept of developing a Print Kiosk service on the base of existing components proved 
to be successful. Functional compatibility of these components limited the programming 


Implementation of a User-Pays Library Printing System: a Case Study 


275 



work to about 120 lines of Bourne shell/awk script (/ cgi-binAiosk ). The solution is open to 
the incorporation of further functional extensions like pre-print page counting, document 
preview or accepting print jobs from Macintosh, NT and Unix environments. 

The Print Kiosk solution has now been installed at six university campuses and works 
reliably. Student printing has increased significantly to the extent that revenue generated from 
the service offsets the infrastructure and maintenance costs. 


References 

[1] McLaughlin HI, L. Line Printer Daemon Protocol [RFC1179], The Wollongong Group, 
August 1990 

[2] Banyan Systems Incorporation, StreetTalkfor Windows NT: Internals and Performance 
Analysis, Westboro, Massachusetts, 1997 

[3] Microsoft, Resource Guide, Microsoft Press, Redmond, Washington, 1996 


276 


Implementation of a User-Pays Library Printing System: a Case Study 



An architecture for remote network management 
using the RMON MIB and programmable agents 

Bradley Williamson + and Craig Farreir + 

+ Australian Telecommunications Research Institute (ATRI) 

+ CRC for Broadband Telecommunications and Networking (CRC-BTN) 

* School of Computing 
Curtin University of Technology 
GPO BOX U 1987, Perth, Western Australia 6845 
Email: brad@atri.curtin.edu.au, craig@cs.curtin.edu.au 
Telephone: (08) 9266 3243, Facsimile: (08) 9266 3244 


Abstract 

Recent years have seen an increase in the size, complexity, number and types of net¬ 
works and associated protocols. This increase emphasises the importance of network 
management. Using various network management techniques, the Network Manager is 
able to maximise network efficiency, so as to improve productivity. Remote monitoring is 
one of many network management techniques which involve the collection and collation 
of data on an agent without the intervention from the network management station (NMS). 

The Remote Monitoring Management Information Base (RMON MEB) is an imple¬ 
mentation database for remote monitoring. The RMON MIB defines how remote moni¬ 
toring (RMON) data is stored for the management station. RMON data is transferred be¬ 
tween the agent and the NMS using the Simple Network Management Protocol (SNMP). 
Despite the fact that the RMON MIB is useful for remote monitoring, many shortcomings 
exist which prevent the agent from maximising its full functionality. 

This paper proposes a technique to overcome these limitations. A programmable agent 
is attached to the RMON MIB. The agent allows the network manager to dynamically 
assign tasks and download programs to an agent. We demonstrate that this solution pro¬ 
vides both a flexible and versatile foundation for distributed management. Furthermore, 
this paper presents some abstractions to remote network management which use the pro¬ 
grammable agent. 


1 Introduction 

The proliferation in networks and associated networking protocols has re-enforced the im¬ 
portance of network management. Network management has been defined as the process of 
monitoring a network in order to achieve the maximum efficiency and productivity [7]. Fur¬ 
thermore, as networks grow, new and more powerful, versatile and flexible network manage¬ 
ment techniques are required. 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 


277 



An implementation for distributed management is the Remote Monitoring Management In¬ 
formation Base (RMON MIB) [21, 22]. Standardised by the Internet Engineering Task Force 
(IETF) remote monitoring working group, the RMON MIB defines how remote monitoring 
statistics are stored. 

The RMON MIB moves away the from the traditional network management as the collection 
and processing of network management data is split between two entities. The probe collects 
the raw network data which is available for the network management station (NMS) to retrieve 
and process. A side effect exists in that a probe can be accessed by multiple NMS (and vice 
versa), thus distributed management is provided. 

Version 1 of the RMON MIB is divided into nine functional groups [21] which generate net¬ 
work statistics at the data link layer. The Ethernet Statistics, History, Alarms, Host, hosttopN, 
Matrix, Filter, Packet Capture and Event groups provide basic functions for remote moni¬ 
toring, however, some groups require the facilities provided by other groups. Recent work 
has seen the development of the RMON MIB version 2 [22] which defines network statistics 
specifically for the network and “application” layers. 

Transferring network statistics between the NMS and the RMON probe uses the Simple Net¬ 
work Management Protocol (SNMP) [1]. SNMP has been adopted as a de facto standard for 
transferring network management data by the Internet community. Managed objects within 
the RMON MIB are encoded using the Structure for the Management of Information (SMI) 1 
[15] meaning that OSI protocols such as CMIP/CMOT can also be used. 

As documented by Waldbusser [21], the RMON MIB attempted to achieve the following ob¬ 
jectives: offline operation, proactive monitoring, value added data, problem detection and re¬ 
porting. Preliminary investigations (discussed in Section 2) into remote monitoring (RMON) 
agents which use the RMON MIB indicate that the standard does not always meet these ex¬ 
pectations. 

This paper presents an alternative solution to remote network management which reduces 
these problems. A programmable agent inherits the functionality of an existing RMON agent, 
however, a programmable management information base (MIB) and a virtual machine are in¬ 
cluded. The programmable MIB stores a program and associated control information. The 
virtual machine executes the program which is stored in the MIB. The ability to dynamically 
add functionality provides a solution for better management because: 

• network overhead is reduced, 

• intelligence is distributed, and 

• an increased level of fault tolerance is available. 

Consider the following example: a network manager wishes to know which hosts are generat¬ 
ing the most traffic when the network load exceeds a threshold. A conventional RMON agent 
would require the NMS to reactivate and continually monitor an Ethernet 2 statistics group. 
Once the NMS retrieves the statistics, the network load is calculated. If the threshold has been 
exceeded then the NMS starts a hosttopN group which collects host statistics for later retrieval. 
The NMS would repeat this process whenever the network load exceeded the threshold. 

'Subset of Abstract Syntax Notation One (ASN.l) 

2 Ethemet is a trademark of Xerox Corporation 


278 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 




Using a programmable agent, a customised network management application could be writ¬ 
ten and downloaded to the agent which is triggered whenever the network load exceeds the 
threshold. The program gathers statistics, generated by the hostTopN group and produces a 
summary table by combining the current hosttopN with any previously captured hostTopNs. 
In this way, a summary table can be produced for the monitoring period during which hosts 
were communicating while the network load was high. The summary table is stored in the 
local MIB for later retrieval by the NMS. 

What the programmable agent provides is a mechanism for the network manager to construct 
a custom application within the RMON framework. Network managers may write programs in 
an interpreted language 3 which are stored and distributed like any other piece of RMON data. 
The only exception being that the data is a program which may be executed later. This removes 
the restrictions which exists in current network management systems where the functionality 
and the structure of network management information is hard wired. 

The programs may be run manually by the NMS or automatically via the RMON alarm group. 
Programs may be any general purpose program written in an intepreted language . The pro¬ 
gram may access other MIB variables (either locally or remotely) via SNMP. Results from the 
program are also produced in this way. 

By allowing general purpose programs to be run from RMON, value added data can be cal¬ 
culated and stored in local MIBs for later retrieval. Fault tolerance is also achieved since this 
mechanism will continue to run even when faults occur. This allows offline operation as no 
intervention from the NMS is required to activate a program. Scalability is increased since the 
NMS is not required to keep state information for the program. The potential network traffic 
bottleneck to the NMS is also alleviated since the value added data is collected, calculated 
(produced) and stored on the remote probe. Under the existing RMON MIB, potentially large 
amounts of raw data need to be transferred to the NMS in order to produce desired statistics or 
value added data. 

This represents a paradigm shift away from the centralised approach to network management, 
dictated by the previous RMON standards. This paper discusses how a programmable agent 
was implemented and how it was subsequently used to overcome some of the shortcomings 
experienced by the current RMON MIB. 


2 Background 

Since 1973, the Defense Advanced Research Properties Agency (DARPA) has been working 
on a suite of protocols used for the transfer of data across the Internet [2]. This suite of 
protocols, which includes the Internet Protocol (DP), was designed with the philosophy that 
multiple services can be provided and distributed across a heterogeneous environment 4 . Clark 
[2] looks at the philosophy behind the DARPA suite of protocols. He concludes that these 
protocols have been successful in academic, commercial and military environments, however, 
lack the facilities for resource management. More importantly, as these protocols are datagram 
based [3], determining network characteristics, based on datagram traffic is difficult. For this 
reason, Network Management is imperative. 

Network Management, as defined by Leinwand and Fang [7] is the process of controlling a 

3 For this document, we shall use perl [23] as the interpreted language. 

4 Heterogenity includes both host and network infrastructures 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 279 



complex data network so as to maximise its efficiency and productivity. Using various net¬ 
work management techniques, the Network Manager is able to detect changes in network 
performance, some of which are influenced by the fluctuation idle bandwidth, number of ma¬ 
chines etc. 

Traditional network management has involved gathering network statistics passively [13]. Pas¬ 
sive monitoring involves the NMS promiscuously 5 capturing packets [12] and leaving the net¬ 
work in its normal state. However, as networks become more complex, traditional network 
management techniques are becoming unsatisfactory [13]. Rabie [13] and Deri [4] both ar¬ 
gue that as large scale networks become more prevelant, the complexity and heterogenity of 
networks are both increasing. Modem networks tend to be segmented into smaller local area 
networks (LANs) for better management, hence, passive monitoring is no longer suitable as 
passive monitoring can only watch a single segment. For this reason, distributed management 
techniques are required. 

Distributed management is where the network management system is spread across to entities 
of a network so that the entire network can be watched, however there is no central control. 
Agents are deployed across the network collecting raw network management data which are 
then collected by the NMS for the Network Manager to view and make judgements. 

A pragmatic approach to distributed management is the RMON MIB. The RMON MIB [19, 
21, 22], standardised by the Internet Engineering Task Force (IETF) Remote Monitoring 
(RMON) working group, defines how remote monitoring statistics are stored. By defining 
a structure for remote network management information, proactive network monitoring of a 
remote segment can take place [16, 17]. This information is available for transfer between the 
RMON agent and the network management application using SNMP. Both the RMON agent 
and NMS need not be on the same segment. 

Traditional network management has involved evaluating network data gathered from a single 
segment. RMON can remove these limitations. Waldbusser [21] outlines five objectives of the 
RMON MIB which aid in removing these limitations. 

1. Value Added Data. As the probe is collecting and generating network statistics, value 
added data can be generated. Value added data, as defined by Waldbusser [21], is the 
type of network management data required to precisely solve a problem. 

2. Offline operation. Offline operation is provided where network management data can 
be collected without the intervention of the NMS. This is useful when network failures 
occur. Value added data is stored in the MIB which can be later collected by the NMS. 

3. Proactive monitoring. Network diagnostics should be continually run, and network per¬ 
formance should be monitored and logged, regardless of network performance [21, 17]. 
When a network fault occurs, the NMS should retrieve the data and diagnose the fault 
based on past performance. 

4. Problem detection and reporting. An RMON probe can provide basic fault management. 
If an error occurs, the NMS can be notified and/or the conditions causing the event are 
logged. 

5. Multiple Managers. The RMON framework uses a client-server relationship with the 
NMS, where the RMON probe is the client. This allows multiple NMSs to communicate 

5 Promiscuous mode refers to the host receiving all packets regardless of the destination address. 


280 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 


with the probe which may result in the concurrent use of resources. A side effect of this 
relationship is that the client (NMS) can access many probes. This means that the NMS 
can combine network management data to determine the overall state of the network. 

The RMON MIB, which is a subtree of MIB-II [10], is divided into nine functional groups: 

• Subgroup 1 - Ethernet Statistics 

• Subgroup 2 - Ethernet History 

• Subgroup 3 - Alarms 

• Subgroup 4 - Host statistics 

• Subgroup 5 - hostTopN statistics 

• Subgroup 6 - Matrix 

• Subgroup 7 - Packet filter 

• Subgroup 8 - Packet capture 

• Subgroup 9 - Event 

Each group provides a basic function of remote network monitoring, however, some groups 
require the facilities provided by other groups. For example: hostTopN requires the facilities 
of the host group. 

As well as defining a database for remote network statistics, the RMON MIB includes a pro¬ 
cedure for initialising and controlling the retrieval of network statistics. Each probe consists of 
a set of probe entities which are activated and deactivated by the NMS using SNMP. To avoid 
any concurrency problems and conflicts, a multi-packet exchange of SNMP packets dubbed 
the RMON-polka is required [16, 17, 6]. 

Despite the fact that RMON MIB is a useful framework for remote network management, 
many shortcomings exist which prevent the NMS from achieving the maximum benefits. Tak¬ 
ing into account the objectives set out by Waldbusser [21], the RMON MIB shows that the 
following shortcomings exist: limited value added data, limited offline operation and reduced 
network bandwidth. 

As mentioned previously, value added data is data which has significant value and can give 
the Network Manager precisely the information needed to solve a class of problems [21]. For 
example, the hostTopN group can generate a list of the top 10 hosts who sent the most packets. 
Despite the fact that it may be possible to locally add value to the data to produce some custom 
view of RMON data, this is seldom done locally - with only value added data being returned 
to the NMS. This means that the network manager has to retrieve the statistics and then process 
them on the NMS to provide a specific view of data. 

Limited offline operation is exhibited in the RMON framework. Ideally, network management 
data should consume little bandwidth as normal services should not be degraded. Solving a 
network management problem which requires value added data to be calculated on the NMS 
assumes a reliable connection. However, in reality, existing RMON probes show that if a fault 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 281 



occurs and part of the network becomes isolated from the management platform then the probe 
has reduced functionality. 

Originally, it was thought that wide area networks (WANs) would benefit from RMON. How¬ 
ever, if a large number of probes are deployed and the NMS continually downloads raw statis¬ 
tics from the agent then a network traffic bottleneck may occur at the NMS. 

Implementations of RMON show that it is restricted by the underlying network management 
transport protocol substrate (SNMP). Two problems experienced with SNMP and RMON are: 

1. Dropped Packets. Dropped packets result from the combination of having a busy RMON 
probe and an unreliable transport protocol (UDP/IP). When the network is heavily loaded, 
the RMON probe may have to relinquish processing resources to the SNMP agent in or¬ 
der to process packets and maintain reliable statistics. Therefore, the RMON probe 
appears to simply ignore (or drop) the SNMP requests for packet capture. Paradoxically, 
when the network is heavily loaded, is also often the time when RMON data is most 
required. 

2. Minimum-Maximum PDU size. SNMP has a minimum-maximum PDU size of 484 
bytes, meaning that capturing raw network data (using RMON’s remote packet cap¬ 
ture) greater than 484 bytes needs to be fragmented. To effectively use this data, the 
NMS has to download the fragmented packets and then reconstruct them. Furthermore, 
the finite time difference in capturing and fragmenting PDUs means that packets may be 
misaligned. 

The analysis of the RMON MIB has shown that the standard provides a useful mechanism for 
remote monitoring. However, many shortcomings exist which prevent the NMS from achiev¬ 
ing the maximum benefits. Some of these shortcomings are caused by SNMP, however, most 
contradict the original objectives of the RMON MIB. 

Overall these problems indicate that the RMON MIB lacks scalability. If the goal of offline 
operation was better achieved then this problem would not have to be addressed. For useful 
network management data to be calculated, a high level of interaction is required between the 
RMON probe and the NMS. 

Furthermore, the RMON MIB lacks fault tolerance. Problems which occur as a result of the 
probe not operating offline suggest that these groups will not function effectively when a fault 
occurs. Ideally, the Network Manager requires network statistics when a fault occurs or when 
a fault is likely to occur. Unfortunately, if the probe cannot function during this period, then 
the network management data gathered is irrelevant. 

In the next section, we discuss a new solution to remote network management which reduces 
the shortcomings of the RMON MIB and better implements the original goals of RMON. 


3 The Programmable Agent 

An RMON probe services SNMP requests and interacts with the MIB. Variables which change 
state affect the operation of the RMON probe. Probes are activated (and deactivated) depend¬ 
ing on the values stored in the MEB variables. We can see that the existing framework for 
remote monitoring using the RMON MIB is not programmable. The NMS is constrained by a 


282 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 



fixed set of tasks offered by the RMON MIB. To produce value added data, raw statistics need 
to be transferred to, and processed by, the NMS. In many circumstances, this increases the 
network overhead and reduces scalability fault tolerance. Furthermore, the existing RMON 
framework moves back to the centralised approach where all the processing is carried out at 
the NMS. 

A solution to this problem is the programmable RMON agent. As mentioned previously, the 
programmable agent inherits the functionality of the existing agent, however, a control-logic 
component is included. The control-logic component not only allows network management 
applications to be remotely executed, however allows for functionality to be transferred to the 
agent. This component is a virtual machine which executes a program stored in the MIB. The 
advantage of using a programmable agent is that: 

1. Active Programming is available. The network manager is no longer constrained to the 
limited functionality offered by the hard-wired MIB. Extra functionality can be dynami¬ 
cally added by uploading an application program. The program is written by the network 
manager. Furthermore, as demonstrated by remote monitoring abstractions (in the next 
section), a program can be locally downloaded (from the MIB), modified and replicated 
to another agent. 

2. Network overhead is reduced. Post-processing the data generated by the RMON MIB on 
the programmable agent can be done by the remote agent itself to generate value added 
data. In most circumstances, the value added data is more compact than the raw data, 
thus, reducing the amount of data which needs to be returned to the NMS. This will be 
shown by looking at some abstractions in Section 4. 

3. Intelligence is shifted. The programmable agent will provide a mechanism for true dis¬ 
tributed management. This means that the real processing of network management data 
does not have to take place on the NMS meaning that scalability is increased. 

4. Localised fault management. Local processing of data generated by the RMON MIB 
enhances flexibility when dealing with faults. Programs can be written so MIB variables 
can be monitored offline. If a fault occurs then corrective action can be taken. The level 
of fault tolerance is increased as the NMS is relieved of the need to process raw data, 
using value added data instead. 

3.1 Design 

3.1.1 Layout 

Figure 1 shows how the programmable component fits into the current agent’s architecture 
(boxed). The programmable MIB is a collection of managed objects concerning a program. 
The virtual machine contains the control logic which executes a program. 

We can see from this diagram that the virtual machine is loosely coupled to the RMON MIB. 
This is done for two reasons. Firstly, if the network management transport protocol is changed 
(e.g. CMIP is used instead of SNMP), the MIB and virtual machine do not have to be modified. 
Secondly, as discussed further, the RMON MIB remains location transparent. 

To use the programmable agent, a program is uploaded (using SNMP) into the MIB. Once 
uploaded, the program is then executed by modifying a MIB variable, where the code is passed 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 283 




Figure 1: Agent Layout 

into the virtual machine. During execution, results can be stored in the MIB for retrieval by 
the NMS. 

Communicating between the MIB and virtual machine is one way. To query the MIB, a local 
SNMP request is required. A side effect of this is that the MIB does not know the source of 
the query (this is hidden by the SNMP agent) meaning that we get location transparency at no 
cost. Abstractions which are presented in Section 4 highlight this feature. 


3.1.2 Programmable MIB 

The programmable MIB stores information about each program which is to be executed on the 
virtual machine. Using a network management transport protocol, such as SNMP, requests are 
received by the agent and the MIB variables are updated accordingly. All modifications to the 
MIB must take place via the agent using SNMP. 

The programmable MIB consists of three groups: Control Group, Program Group and Data 
Group. 

Control Group 

The control group controls execution of each program. As shown in Figure 1, interaction with 
the virtual machine only takes place via the agent’s MIB. Information that will be stored in 
this group includes: Control Status, Program Owner and Trap Host Details. 

Modifying these fields takes place via an SNMP request. These requests can be generated by 
the NMS or a program which may exist on another host. Overall, the MIB is unaware of the 
source of the request meaning that location transparency is automatically achieved. 

Data Group 

The data group stores variables which the program wants to make available to the MIB. This 
can be used for debugging purposes or for intermediate or final output from the program. The 
data table is a two-dimensional array made up of a sequence of variable names and associated 
values. The purpose of the variable name is to inform the calling source of the data’s reference. 

By allowing the program to define how data is output, it is possible for existing MIBs to be 
abstracted. Notwithstanding that this work uses the RMON MIB as a foundation, customised 
applications can be developed which can reproduce the MIB. 

Program Group 

The program group stores the code which is to be executed by the virtual machine. There is no 
syntax or semantics for the data, as it assumes that error checking is carried out by the virtual 


284 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 









. RMON Statistics 



ndgcom.au 

3rd Hoot 


134.71.7 

00-72-64-20-66-6C 


Statistic 


By»es 


Packets 


Diop Events 


Total Countl 


12487261 


5254 


Packet Type 

Miiticast 

IS 

Broadcast 

181 

Unicast 

2483 

Problems | D* of Tot. Errors 

CRCVtfgnment 

0 

Fragment* 

0 

Jabbers 

0 

Coknons 

0 

Undersized 

2574 

Oversized 

0 



|Fiame ScetBytes] Packets | 

64 

4 

65-127 

1385 

126 - 255 

604 

256 - 511 

118 

512-1023 

40 

1024-1518 

529 


Interlace 1 - ecO 


Peak 

Date Time 

2 Load I 6.406 

6.40& 

10/21/96 at 11:11:07 


1166981 

10/21/96 at 11:10:52 

49381 

10/21/96 all 1:10.52 

C 





15 

10/21/96 at 11:10:52 

W 

18C 

10/21/96 at 11:10:52 

2307 

10/21/96 at 11:10.52 




0 



0 



0 



0 



2436 

10/21/96 at 11:1052 


0 





3 

10/21/96 at 11:1052 


H 

10/21/96 at 11:10.52 


545 

10/21/96 at 11:1052 


■Mtl/., 

10/21/96 at 11:1052 


39 

10/21/96 at 11:10.52 


■LI 

10/21/96 at 11:1052 



©Total ODcIta OAccum. □ Freeze $ta»; 


Figure 2: RMON GUI Client 

machine. 

When a program is executed, the virtual machine is passed a copy of the code stored in this 
group. 


3.2 Implementation 

Implementation of the programmable RMON agent took place in two phases. The first phase 
was to build a baseline agent which conformed to the RMON MIB standard. To show confor¬ 
mance, an interoperability test was carried out using the Spectrum Element Manager 6 . Spec¬ 
trum provides an RMON client using a graphical user interface for displaying raw RMON 
data. Figure 2 shows a snapshot of the RMON ethemet statistics group which collects basic 
network information. 

The second phase was integrating the programmable component. This was achieved by in¬ 
cluding the virtual machine component, which was the interpreted language perl [23], Fur¬ 
thermore, a series of managed objects (the programmable MIB) was integrated into the existing 
MIB allowing programs to be uploaded. 

The overall result is a software agent (rmond - Remote MONitoring Daemon) which is de¬ 
ployed throughout the network. To access the agent, a series of network management applica¬ 
tions (command line based) were developed. Furthermore, rmond is complemented by a series 
of applications which implement the abstractions, presented in Section 4. 


4 Programmable agent abstractions 

This section discusses some applications of the programmable agent. Obviously, the number 
of applications is unlimited as the Network Manager can now specify its functionality. The 

6 Developed by Cabletron Systems. 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 


285 









applications presented in this section demonstrate how the programmable agent can be used to 
overcome many limitations experienced with previous RMON standards. 


4.1 Migratory Agent 

Suppose the Network Manager wants to collate statistics based on network management data 
from multiple RMON probes. If cumulative statistics are to be calculated then the NMS needs 
to connect with each RMON probe and collect raw data. We can clearly see that this approach 
is cumbersome because the network bandwidth consumed will bottleneck at the NMS. This 
problem stems from the centralised approach which is evident in existing RMON implemen¬ 
tations. Furthermore, it is assumed no faults will occur. As mentioned previously, if a segment 
becomes isolated then vital data is lost. 

The problem with assuming a reliable link is that Network Managers require significant net¬ 
work data when the network is busy or when a fault occurs. Furthermore, the volume of 
network management data must not hinder the performance of the networked environment. 

An application of the programmable agent which solves these problems is the migratory ap¬ 
plication. A migratory application is an application that can migrate from one host to another, 
maintaining the state of the user interface. In the context of network management, a migratory 
application can be used to traverse the network (calculating value added data at each hop) and 
migrating to the next host for further processing. When the program has reached its conclusion 
then the final result will be available. 

The migration is achieved by the following algorithm: 

1. Download a copy of itself. As the program is stored in the database as ASCII text, a 
copy of itself is stored in an array. 

2. Modify code. The code is modified depending on the network management data being 
calculated. 

3. Copy the code to the next agent. This is achieved the same way as the NMS uploading 
and activating a program. 

4. Execute code on the next agent. 

To keep the programmable agent “efficient”, subsequent executions remove the previous agents 
code. However, the end point of the migration will leave the results so that it can be collected 
by the NMS. 

By developing an alternative to remote network management, the migratory agent achieves the 
following: 

1. Intelligence is shifted. This program demonstrates how intelligence is shifted from the 
NMS. Once the program is active, no intervention is required from the NMS. This rein¬ 
forces the distributed management paradigm and value added data can be computed. 

2. Fault tolerance is increased. If a segment becomes isolated, then a migratory application 
can remain stationary until the network becomes stable. The example can be extended to 
allow for the collection of further data if such a problem occurs. Increased fault tolerance 
reinforces the original RMON objective of offline operation. 


286 An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 




Figure 3: Proxy Graph 


4.2 Proxy Agent 

Consider the following example. You have a bank with a central point of control and a set of 
branches. Branches from each state connect via a common link (one from each state) to the 
central node. Using existing RMON agents would require the central node to poll each agent 
to retrieve vital network statistics. 

If the network is busy (in terms of the amount of normal services) and the common links are 
hired from a communications carrier then this can be very costly in terms of network time 
and monetary cost. Ideally, agents need to be deployed in each state and sub agents at each 
branch. These agents collaborate with existing RMON agents, however, they extract common 
data which is sent back to the central node. This method will reduce network traffic between 
the central node and reduce cost. Furthermore, normal network services will not be degraded. 

The proxy agent is another method for distributed remote network management. Unlike the 
migratory application, the proxy agent collects data on behalf of other agents. Given a network 
topology, the proxy program collects data from other agents on behalf of a user agent or the 
NMS. If implemented correctly, the proxy agent is location transparency. 

Figure 3 shows a simple diagram of a network represented as a directed graph. 

To localise network overhead, collecting data by proxy involves the deployment of multiple 
proxy agents. Shown by the graph (in Figure 3), agents with a degree greater than 1 act as 
proxies in addition to collecting local information. 

To determine which agent is called, a two-dimensional array is searched. Below is an example 
of such an array which represents the topology of Figure 3. 

$proxy[0][0] = ' Al' ; 

$proxy[0](1] = 'A2'; 

$proxy[0][2] = 'A5'; 

$proxy[0][3] = 'end'; 

$proxy[1][0] = 'A2'; 

$proxy[1][1] = 'A3'; 

$proxy[1][2] = ' A4' ; 

$proxy[l][3] = 'end'; 

$proxy[2][0] = 'A4' ; 

$proxy[2][1] = 'A6'; 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 


287 



$proxy [2][2] = ' A7'; 

$proxy[2][3] = 'end'; 

The first element of each row identifies the agents which act as a proxy. Subsequent elements 
identify the hosts on whose behalf the proxy acts. For example, host A1 is a proxy agent for 
host A2 and A5 while A2 and A4 are proxy agents which collect data on behalf of A3, A4 and 
A6, A7 respectively. 

As with the previous examples, many advantages exist using this method. 

1. Data is localised between segments. The raw data does not need to be transferred to the 
NMS. This can be filtered out at the agents. 

2. Intelligence is shifted from the NMS. Collecting network management information can 
be delegated to sub agents (proxies) which collate the data locally and return only value 
added data. 

3. Statistics can be collected from any point of the network. The recursive nature of the 
program demonstrates that data can be collected for a subset of the topology. 

4. The code is topology independent. The example demonstrated a proxy agent for a tree 
topology. By modifying the $proxy array, the collation of network management data 
by proxy will work for any topology. 

5. Value added data is generated. 

4.3 Fault Tolerant Agent 

A problem with the current RMON standard is limited fault tolerance. The current RMON 
standard employs an alarm group where each alarm monitors a single MIB variable. If a 
threshold has been exceeded then the NMS is notified (by an snmp-trap) and subsequent action 
is taken. This emphasises the centralised approach meaning that fault tolerance is reduced. If 
the RMON agent becomes isolated then no monitoring takes place. 

The fault tolerant application works by executing a migratory application (stored in an array) 
which is transferred to the remote agent’s MIB using snmp-set when a threshold is exceeded. 

Again, this program highlights the movement away from the centralised approach as insignif¬ 
icant data is not processed by the NMS. Therefore, this program achieves the goal of value 
added data. 

This program is more fault tolerant than the alarm group because if a segment between the 
agent and the NMS is isolated then network management data is collected. This also reinforces 
the goal of offline operation, originally set out in the RMON MIB [21]. 

4.4 Combined Example 

Examples provided in the previous sections demonstrate how raw statistics (from the RMON 
MIB) can better achieve the fundamental goals of the original RMON MIB [21]. As men¬ 
tioned previously, these objectives are offline operation, fault tolerance, value added data and 
scalability. 


288 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 



Writing an application which combines these four features will demonstrate the flexibility of 
the programmable agent. Furthermore, distributed management will be enforced. 

The RMON MIB provides a hostTopN group which generates ordered statistics based on a 
value in the host table. The hostTopN group is limited in that a hostTopN probe entity can only 
be executed once. Subsequent iterations require the probe entity to be reinitialised. Hence, 
intervention from the NMS is required. This problem stems from the centralised approach to 
network management, forced by non-programmable RMON agents. 

The RMON MIB also provides the alarm and event groups. The combination of these groups 
allows for alarm thresholds to be compared and if a threshold has been exceeded then either 
an event is logged and/or an alarm is generated. The limitation of these groups is that alarm 
thresholds can only be compared to one MIB variable meaning that value added data cannot be 
monitored. Collecting network data based on a fault requires a probe entity to be established. 
This process is normally carried out by the NMS. 

A programmable application, in which the algorithm is provided below uses the hostTopN 
group to provide value added data. When the network utilisation (itself value added data), 
calculated from a number of MIB variables (the number of packets and bytes), exceeds a 
threshold a hostTopN report is generated. Each report will be copied into the data table which 
stores a value added summary of the hostTopN rankings each time they are generated. 

Two programs were written to demonstrate this application. The first program monitors the 
network utilisation (itself a value added variable) and triggers an event when the threshold has 
been exceeded. The second program collects and produces value added data (a summary of 
hostTopNs) which is stored in the data table for collection by the NMS. 

Program 1 

Calculate network load 

(based on the number of PDUs and octets) 

if load > threshold 
activate program 2 

Program 2 

Initialise host table 
Activate RMON probe 
Wait 10 seconds 
Deactivate RMON probe 
Collect results 
Store in table 

Note that program 2 can be expanded to trigger multiple events where the program code resides 
on multiple hosts. The proxy agent can be integrated which will collect network statistics from 
another agent. 

The update data table looks at the hostTopN list and produces a summary table (shown below) 
of the busiest hosts. The table (shown below) effectively illustrates that on the 24 occasions the 
utilisation threshold was exceeded, host 08:00:69:07:6e: cb was the top communicating 
host 10 times, host 08:00:69:08: eO : a8 was the top communicating host 6 times, etc. 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 


289 



Alarm trigger = 0 
count = 24 

1 = 00:00:0c:01:8a:db 039 

2 = 00:00:Id:0a:d4:53 000 

3 = 00:00:c0:57:cc:0b 101 

4 = 00:00:c0:9f:al:8a 001 

5 = 00:a0:c9:00:cf:0e 000 

6 = 08:00:20:06:ea:2a 102 

7 = 08:00:20:Ob:al:18 000 

8 = 08:00:20:Ob:fl:37 000 

9 = 08:00:20:0d:f4:54 000 


10 

= 

08:00:20:10:0f:95 

0 

0 

0 

11 

= 

08:00:20:la:47:bf 

0 

1 

0 

12 

= 

08:00:20:79:41:99 

0 

0 

0 

13 

= 

08:00:69:02:43:5c 

0 

0 

1 

14 

= 

08:00:69:06:3a:25 

0 

0 

0 

15 

= 

08:00:69:06:79:77 

0 

0 

0 

16 

= 

08:00:69:07:37:1a 

0 

0 

0 

17 

= 

08:00:69:07:60:5d 

0 

0 

0 

18 

= 

08:00:69:07:60:74 

0 

0 

0 

19 

= 

08:00:69:07:6e:31 

0 

0 

0 

20 

= 

08:00:69:07:6e:be 

0 

0 

0 

21 

= 

08:00:69:07:6e:cb 

10 

2 

2 

22 

= 

08:00:69:07:6e:dl 

0 

0 

0 

23 

= 

08:00:69:07:6e:e2 

0 

2 

1 

24 

= 

08:00:69:07:6e:e4 

1 

2 

2 

25 

= 

08:00:69:08:9d:22 

0 

0 

0 

26 

= 

08:00:69:08:e0:a8 

6 

8 

1 


6 4 2 0 0 0 0 
0 0 0 0 0 1 0 
0 0 0 0 0 1 1 
1 0 3 2 2 4 2 
1 2 0 0 2 0 0 
0 0 1 2 3 2 4 
0 0 0 0 0 1 0 
0 12 12 0 1 
0 1 1 0 0 0 0 


0 0 
0 1 
3 5 
0 2 
0 2 
0 0 
0 0 
0 0 
0 0 
0 0 
0 0 
2 3 
0 0 
0 0 
1 2 
0 0 
5 0 


0 3 2 0 1 
1114 0 
4 5 2 4 0 
0 0 0 0 0 
0 0 2 0 3 
0 110 1 
0 0 10 0 
0 0 0 1 0 
0 0 0 0 1 
10 0 10 
0 10 0 1 
2 110 0 
0 0 0 2 3 
1 0 0 0 0 
0 0 111 
110 11 
2 0 10 0 


The alarm trigger, set by program 1, indicates if the program needs to update the table. A 
counter variable is used to indicate the number of times the table has been updated. 


Subsequent rows list the hosts (sorted by hardware address) and the number of times each host 
has appeared in each ranking of a hostTopN report. A quick interpretation shows that when the 
network utilisation exceeds 10%, host 08:00:69:07:6e:cb generally transmits the most 
PDUs. We can see that 08:00:69:07:6e:cbhas appeared in the hostTopN table 10 times 
at rank 1, 2 times at rank 2, etc. 

This program conbines the goals which are demonstrated by the previous examples. Offline 
operation is now available as value added data can be monitored without the need for interven¬ 
tion from the NMS. Furthermore, the program can be activated from an event. 

Data generated for the NMS is value added. To achieve the same result using the standard 
approach, the NMS would have to continually collect hostTopN reports and calculate the list. 
This examples makes value added data available which does not require any further processing 
by the NMS. 

The level of fault tolerance has also increased. Suppose a segment becomes isolated from the 
NMS and a fault occurs. A program is activated without any intervention from the NMS. 

Shifting the intelligence from the NMS has created a more scaleable solution. This reinforces 
the distributed management paradigm. 


290 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 



5 Discussion 


We have shown (in Section 3) an alternative technique to remote network management, called 
the programmable agent. Furthermore, we have also seen (in Section 4) some abstractions 
to remote monitoring which use the programmable agent. The combination of these tech¬ 
niques and abstractions along with the existing RMON MIB provide a better framework for 
distributed management. 

As mentioned previously, distributed management moves away from traditional, centralised 
network management where ALL the intelligence is at the NMS. We have seen that the pro¬ 
grammable agent provides distributed management as functionality can be injected into agents. 
Furthermore, the following added features are exhibited: 

1. Scalability. Most centralised approaches and some distributed implementations (includ¬ 
ing the RMON MIB) use the client-server approach [5]. In the existing RMON MIB, 
collating data based on the output of two agents requires the NMS to retrieve and collate 
both data sets. However, using the programmable agent, it is now possible for RMON 
agents to collaborate. This increases scalability as more agents can be deployed and the 
NMS does not have to continually probe them. 

2. Heterogenity. The programmable agent exhibits system independence by using an in¬ 
terpreted language which hides system dependent functions as well as using a common 
transfer syntax (ASN.l) and transport protocol (SNMP). This means that applications 
are not dependent on RMON agents which are suited to a particular architecture. As 
networks are becoming more complex and the number of system architectures increas¬ 
ing, heterogenity is a key issue in a distributed network management system. 

3. Fault Tolerance. Allowing programs to be executed locally increases fault tolerance. 
This is imperative to successful network management as the Network Manager always 
wants to know the status of the network, regardless of whether it is isolated for a certain 
period of time. By implementing a programmable agent, along with its support utilities 
(discussed in Section 5), a distributed network management framework (using the agent 
rmond) was put into place to achieve the above. 

6 Future Work 

There are many directions in which this project could take. One of these is security. Although 
security issues have not been formally addressed in this research, these problems have been 
identified and will be used as the basis for future projects. 

In a commercial environment, security would be a high priority. SNMP version 1 [1], SNMP 
version 2 (SNMPv2) [1] and SNMP version 2 classic (SNMPv2c) provide adequate mecha¬ 
nisms for protecting unauthorised hosts from accessing the MIB. 

rmond uses SNMP version 1 as its transport protocol. All requests to the agent require a 
community string. Requests with an invalid community string will be ignored by the agent. 

It was noticed that applications that use multiple agents (i.e. the migratory and proxy appli¬ 
cations) reduce the level of security. If a community string was discovered on the first agent, 
then subsequent agents can be accessed. The reason for this is that the program needs to be 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 


291 



made aware of the community string on subsequent agents. These community strings can be 
found by traversing the MIB on the first agent. 

Another security concern is the process owner of programs, rmond requires promiscuous 
packet capture as the basis for generating RMON data meaning that on some workstations, 
rmond may need to be executed by the super-user (with an effective user-id of 0).This is 
because some packet filters require the effective user to be 0 before the network interface can 
be accessed. A side effect is that the application may be executed with super-user privileges. 

In practice, it was decided to execute an application with an effective user-id of -2. A similar 
solution has been used in conjunction with the Network File System (NFS) [18]. 


7 Conclusion 

Remote monitoring in the context of network management has given the Network Manager 
the ability to better maximise the efficiency of a network. Overall, this leads to better network 
performance and productivity. 

The Remote Monitoring Management Information Base (RMON MIB) [19, 21, 22] provides 
a mechanism for gathering network statistics on remote segments. Despite the fact that the 
RMON MIB is useful, problems exist which prevent the Network Manager from achieving its 
full functionality. 

We believed that the RMON MIB had much room for improvement. The addition of a pro¬ 
grammable component provides the Network Manager with greater flexibility. The solution 
proposed demonstrated that network management is no longer constrained to the functionality 
offered in existing MIBs or hindered by the problem of centralised network management. Fur¬ 
thermore, this solution has demonstrated (using the four abstractions) that many of the goals 
of the RMON MIB can be more fully achieved. 

Network management is nowhere near perfect (as the Network Manager would like to believe). 
To save costs and improve network performance, the Network Manager is always looking for 
better, faster, reliable and more optimal alternatives. This paper has presented an innovative 
solution allowing Network Managers to better manage the network and not be constrained by 
problems which exist in current network management architectures. 


References 

[1] J. Case, K. McCloghrie, M. Rose, and S. Waldbusser, “RFC 1441: Introduction to version 
2 of the Internet-standard Network Management Framework,” May 1993. 

[2] D. D. Clark, “The design philosophy of the DARPA internet protocols,” Computer Com¬ 
munication Review, vol. 18, pp. 106-114, Aug. 1988. 

[3] D. E. Comer, Internetworking with TCP/IP: Volume 1: Principles, Protocols, and Archi¬ 
tecture. Prentice Hall, 1991. 

[4] L. Deri, “Network management for the 90s,” tech, rep., IBM Zurich Research Laboratory, 
1996. 


292 An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 



[5] G. Goldszmidt, “Distributed system management via elastic servers (position paper),” in 
Proceedings of the IEEE First International Workshop on Systems Management, 1993. 

[6] R. Kooijman, J. van Oorscot, and D. Wiesse, “RMON implementation issues for man¬ 
agers and agents,” draft request for comments, Delft University of Technology, Sept. 
1994. 

[7] A. Leinwand and K. Fang, Network Management: a practical perspective. Addison- 
Wesley, 1993. 

[8] K. McCloghrie and M. Rose, “RFC 1066: Management Information Base for network 
management of TCP/IP-based internets,” Aug. 1988. Updated by RFC1156 [9]. 

[9] K. McCloghrie and M. Rose, “RFC 1156: Management Information Base for network 
management of TCP/IP-based internets,” May 1990. Obsoleted by RFC1158 [14]. Up¬ 
dates RFC 1066 [8], 

[10] K. McCloghrie and M. Rose, “RFC 1213: Management Information Base for network 
management of TCP/IP-based internets: MIB-H,” Mar. 1991. Obsoletes RFC1158 [14]. 
See also STD17 [11]. 

[11] K. McCloghrie and M. Rose, “STD 17: Management Information Base,” Mar. 1991. See 
alsoRFC1213 [10]. 

[12] J. Mogul, “Efficient use of workstations for passive monitoring of local area networks,” 
Research Report WRL 90/5, Digital Equipment Corporation Western Research Labora¬ 
tory, 1990. 

[13] S. Rabie, “Integrated network management: Technologies and implementation experi¬ 
ence,” in INFOCOM ’92, 1992. 

[14] M. Rose, “RFC 1158: Management Information Base for network management of 
TCP/IP-based internets: MIB-II,” May 1990. Obsoleted by RFC1213 [10]. Obsoletes 
RFC1156 [9]. 

[15] M. Rose and K. McCloghrie, “Structure and identification of management information 
for TCP/IP-based internets,” Request for Comments 1155, Internet Engineering Task 
Force, May 1990. 


[16] W. Stallings, SNMP, SNMPv2 and CMIP: The Practical Guide to Network Management 
Standards. Addison-Wesley, 1993. 

[17] W. Stallings, SNMP, SNMPv2 and RMON: Practical Network Management. Addison- 
Wesley, 1996. 

[18] Sun Microsystems, Inc, “RFC 1094: NFS: Network File System Protocol specification,” 
Mar. 1989. 

[19] S. Waldbusser, “RFC 1271: Remote Network Monitoring Management Information 
Base,” Nov. 1991. Obsoleted by RFC1757 [21]. Updated by RFC1513 [20]. 

[20] S. Waldbusser, “RFC 1513: Token ring extensions to the remote network monitoring 
MIB,” Sept. 1993. Updates RFC 1271 [19]. 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 293 



[21] S. Waldbusser, “RFC 1757: Remote Network Monitoring Management Information 
Base,” Feb. 1995. Obsoletes RFC1271 [19]. 

[22] S. Waldbusser, “RFC 2021: Remote network monitoring Management Information Base 
version 2 using SMIv2,” Jan. 1997. 

[23] L. Wall, “The Practical Extraction and Report Language - version 5.003.” Computer 
Software, Feb. 1996. 


294 


An Arch, for Remote Network Mgmnt using the RMON MIB and Programmable Agents 



Who’s sucking my data? 


George Michaels on 

Manager, Technical Services, DSTC Pty/Ltd 
ggm@dstc.edu.au 


ABSTRACT 

This paper examines some issues in the processing of web and FTP logs for a large mirror site, 
and discusses some issues exemplified by the typical software development non-process for its log pro¬ 
cessing methodology. 

• Don’t trust DNS to say where things come from. 

• Some UNIX tools could maybe do with some cat -v. 

• Ask around how to code. Others do it better! 

• Old algorithms never die. Dusty decks are worth re-reading. 


1. Data? what Data? 

The DSTC 1 facilities manages the AARNet FTP mirror, 2 a 150Gb web and FTP server mirroring a range of packages from 
overseas sources, aiming to provide a faster and cheaper source onshore to the academic community (and the wider Internet 
regionally, as a side-effect). Like many similar resources the project has to account for usage, and show cost:benefit outcomes 
favouring the funding body’s interest group: the AARNet community ’at large’. Already in under 6 months of operation the 
project has served over 800Gb of data for an import cost of around 100Gb. This has delivered at least 3:1 efficiency to the 
direct interest community, and considerably faster access to the mirrored packages. But at the same time, it has around 50,000 
users per day (and growing) and thus an increasing load in logfile processing. 

At first glance, this is a ’solved problem’ - any of a number of web and FTP related sites online can be used to find alterna¬ 
tives for processing logfiles and producing accounting information. In fact the mirror project has chosen to use one of the 
more common packages, Analog [ANA] precisely because its so widely used, and provides a flexibility of output forms and 
good statistical accuracy. 


2. DNS. Fast, reliable: pick one... 

However on delving into the processed logs, it is very easy to see there is a substantial problem lurking under the surface: 
DNS is not a guaranteed service, and a significant number of users have non-resolvable IP addresses, or names which don’t 
map back cleanly to the IP address used at the time of access. Peaks of over 25% non-resolving addresses have been seen, 
which impacts on the accuracy of name-based log processing, (see figure 1) On reflection, both the increase in modem access 
to the Internet and the increasing use of dynamic IP address assignment to dialup users may well have something to do with 
this problem: 

• The relationship between a specific address and a dialup user is ’ephemeral’. 

Tying a large pool of addresses to valid forward and reverse DNS names is time consuming, and systems administra¬ 
tion may be tempted to cut comers. 


2.1. But AARNet is ok isn’t it? 

The distribution of these ’problem’ users across different sub-classes of the Internet community from a local perspective (eg 
AARNet, onshore-private, onshore-government, offshore) is not equal, (see Table 1) Although an accurate redistribution of 
the numbers can be derived it forced the project to reconsider the value of DNS names in this specific context. 


1 (CRC for Distributed Systems Technology: http://www.dstc.edu.au) 

2 http://mirror.aamet.edu.au 


Who s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 


295 



35.0 X 
30.0 X 
25.0 X 
20.0 X 
15.0 X 
10.0 X 
5.0 X 
0.0 X 

01/04 01/05 01/0G 01/07 01/08 01/09 01/10 01/11 01/12 



Figure 1. % of DNS requests which are unresolved, web and FTP. 
from 225 days of logs (since Jan 1) 


userclass 

count 

min 

max 

avg 


aam 

98 

0.003 

47.176 

5.846 


au 

99 

2.684 

55.931 

12.321 

non-aam/csiro/rfc 1918 

csiro 

14 

0.005 

69.018 

10.582 


rfcl918 

39 

0.061 

97.303 

16.657 


all 

99 

2.872 

25.796 

11.450 

(all subtypes) 


Table 1. distribution of unresolved DNS across 4 subsidiary classes and all users 


2.2. We know where you live... in Tonga? 

On further examination it was discovered that a small, but non-trivial number of users within the prime community were als 
using so-called ’generic’ top-level domains (gTLD) and even some other ISO country code domains (ccTLD) and thus incoi 
rectly being counted as ’offshore’ users. This trend is only going to continue, with the emergence of the new 7 gTLD late 
this/next year, and availability of ’high value’ names in the ISO country code domains such as the .TO domain which is ru 
as a commercial resource. (See Table 2) 


from 225 days of logs (since Jan 1) 


userclass 

count 

max 

avg 

aam 

50 

75.94 

9.33 

au 

471 

85.08 

5.13 non-aam/csiro/rfc 1918 

csiro 

3 

2.46 

0.93 

rfcl918 

18 

99.92 

42.07 


Table 2. distribution of gTLD and non-AU ccTLD in 4 subsidiary user classes 


3. So what are our users and why does it matter? 

Irrespective of these problems with DNS data, the question ’who are users are’ has two rather different meanings: 

[1] Who are they, as in which ’interest group’ do they lie in? 

In this model, the class of users who can be considered as being AARNet + CSIRO + ’extras’ (eg some govemmen 
agencies, CRC’s etc) Is the primary target audience, and it needs to be shown the project serves 3 and reflects those wh< 
ultimately pay for the service. ’Everybody else’ can be considered 
an extra benefit in the case of domestic usage, 

an extra social equity load in the case of regional traffic AARNet might want to sponsor (eg within the APAN com 

munity) 

and slightly odd in the case of users from Europe, continental USA and worldwide in general, who seer 

to find us an attractive source of downloads. 4 

3 (with AARNet and CSIRO the primus inter pares since they are paying for the service directly) 

4 Possibly because of alphabetical order, and not otherwise a problem except it costs us money. See later... 


296 


Who’s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 





































Being able to determine which category a user lies in makes it possible to show correct cost:benefit outcomes to the pri¬ 
mary target. It allows the the more ’political’ issues of offering service in an Internet under a specific funding regime to 
be addressed. 

[2] How much does it cost us to serve these people? 

In this model, there are three distinct classes of users, a function of how much it costs us to handle data coming from 5 
their IP address. (See table 3) 


category 

cost implications 

AARNet + CSIRO + ’extras’ 
any domestic ’other’ 
any international ’other’ 

AVCC network rates 

Optus domestic settlement rate 
Optus international rate 


Table 3. cost implications under 3-tier differential charging for user class 

In this model, there is a more direct financial interest in knowing and possibly limiting use based on where the users 
come from. Clearly, aspects of both forms of question reflect a need to analyze usage by classes however they are 
formed. The specific address classes used is more directly related to the second form, but usually also models the first 
form 6 

3.1. Where do the costs come from? 

Providing large and popular download sources incurs an import cost, as a function of the volume of traffic served. This import 
cost comes from three sources: 

• The applications-protocol specific ’request’ side of the ’request’ - ’response’ exchange, required to initiate the fetch of 
data. 

• The TCP and IP overhead for acknowledgment of data received. 

• Indirectly incurred traffic, such as DNS responses and syslog() activity off-host. 

Of course, the direct import cost of the packages themselves has to be taken into account, although that is excluded from this 
specific analysis process at present. Since the project controls import, it can account for the cost by source network class rea¬ 
sonably accurately. 

The combination of protocol overhead and TCP/IP acknowledgment cost is estimated at between 10% and 20% of the served 
data volume. This is a very imprecise statement, because its a function of several variant parts: 

• The Path MTU and window size in the TCP layer which affects the ratio of sent fragments, and back acknowledg¬ 
ments. 

• The end-to-end loss or packet drop, which increases retransmission and incurs extra back acknowledgments. 

• The innate difference between FTP and WWW for average requested object size, and thus the different ratio of request 
size to served data in the mix of web and FTP data served. 

• The length of the web URL and FTP filepath which increases the inbound data during the request phase. 

3.2. Serving data to the wrong people can be expensive (for either party) 

For working purposes the mirror project has chosen to model the overhead as 10%. Thus an 800 Gigabytes of served load 
incurs an estimated 80 Gigabytes of overhead. Clearly if there was a significant percentage of offshore users, Under this level 
of overhead the project could incur up to double the current 100 Gigabytes data import cost in running the service, which 
reduces the relative cost savings accordingly. 

Additional cost includes the bandwidth consumed, and rescaling costs if congestion ensues. All these costs have to be set 
against the value of providing service which ethically is attractive, but can also be a financial incentive to an ISP in a peer- 
settlement regime since the amount of data exported (and by definition imported by the peers) can reduce relative settlement 
debt, or even convert it into a net profit. 


5 Since the current charge model assumes inbound traffic predominates, and that is what is charged on, although if this flipped 
over to charge by bytes served, the class would remain the same. 

6 This is decreasingly true as dialup access in the tertiary sector is ’outsourced’ and unless ISP’s provision such users from spe¬ 
cific address spaces, it may become impossible to account for all legitimate primary target usage. 


Who’s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 


297 




4. Planning a logfile process #1. 

Early on in the pilot project, it was realized that being able to account for the usage accurately was going to be vital. With u 
resolvable IP addresses running over 10% and mis-assigned domain names adding further uncertainty, the imprecision in t 
cost modeling was felt to be excessive, until benefit could be clearly shown to the core target, and excess usage and associat 
costs sustained as overhead. 

Looking at the growth curve of usage by served bytes, and by end-users the project also required methods for processing lc 
files which scaled across 10 times increases in served load. 7 (see Figure 2 and Figure 3). And initially, it was assumed tl 
’live’ processing would be preferable. 


100000 
80000 
60000 
40000 
20000 
0 

01/04 01/05 01/06 01/07 01/08 01/09 01/10 01/11 01/12 

Figure 2. number of downloads since April 1 1998. 


service uptake 

nunber of users 



. 

1 II III II 

Uk 


1 


1 

llll 

111 



01/04 01/05 01/06 01/07 01/08 01/09 01/10 01/11 01/12 

graph draunt 14/08/98 02;22;25 


Figure 3. mirror site usage since April 1 1998 with linear trend. 


4.1. Don’t throw away data in logfiles! 

Almost immediately, it was realized that in logging either DNS name, or IP address only, information was lost because 

• Some IP names do not map to a DNS name, and thus no inference can be made about who the user thinks they are, i 
requesting service (in practice, most IP addresses map to within 1 sub-domain of a plausible user, and can be identifier 
to the country or top-domain, but this requires work to be reliably embedded in procedures, and is clumsy to clean u] 
manually) 

• Some DNS names do not map reliably to an IP address, and thus its hard to find out where they came from, using nami 
information alone. 

Given the low overhead of adding IP address information to logfiles 8 it was decided to modify the FTP and www daemon, 
accordingly. This has incurred some local software maintenance overhead for upgrading 9 which has yet to happen in the lift 
of the project, and is estimated at less than one days work per instance. 

Service providers are very strongly recommended log both name and address. 


7 This represents less than 4 doublings which is achievable inside a 1-2 year timeframe, at a 6-9 months doubling cycle. This is 
typical on the Internet. 

8 15 bytes in standard ASCII dotted-quad notation 

9 Since reduced to only wu-ftpd since apache-1.3 now supports embedded raw IP address as a token in its configurable logging 
module. 


298 


Who’s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 




4.2. Use other peoples work! 

AARNet nationally has to account for internal and external relative and absolute traffic flows. This in turn is reflected by indi¬ 
vidual AARNet member institutions having to account for traffic flows. At U.Q., amongst other tools, the Cisco ’netflow’ 
[CIS] router-embedded method is used. This process requires lists of known IP address ranges to be categorized so the net- 
flow software can identify bytecounts by the cost center concerned. 

Because the project had access to Cisco netflow-enabled systems , it was also quickly realized that the netflow configuration 
files could be used to manage a list of known bindings between networks and ’user classes’. This had clear advantages. 

Because somebody else was maintaining the netflow configuration 
(and it related to money, so wasn’t going to become an orphan 
project) the list of known networks and their associations was 
going to be reasonably reliable. 

However, it also introduced some immediate concerns about scalability. Anecdotally, it’s well known that around 50,000 dis¬ 
tinct routes ’float around’ in the BGP exchanges worldwide, maintaining the connectivity of the global network. These repre¬ 
sent the disjoint, (in many cases semi-random) allocation of networks to entities with access to the network. Prior activity fil¬ 
tering packet flows against address lists at U.Q. had been done, at considerable cost in memory-resident hash lists astronomi¬ 
cally large processes 10 and weekends of processing time. Admittedly, this was just ’FUD’ but it had some effect in encourag¬ 
ing a search for an efficient method of processing the logfiles. 


4.3. Definition by exclusion: don’t work when you don’t have to. 

An initial decision was taken to define ’offshore’ by exclusion: 

Anything not identified in an address list of more immediate concern 
was very likely to be an offshore network, nondescript. 

This reduced the scope of the required comparison against networks onshore considerably, to under a few thousand at most. 11 


4.4. Arrays and bitmaps == faster execution? 

The first design assumed that the onshore addresses could lie anywhere in the global IPV4 address space. Therefore a method 
which would permit a rapid identification of an onshore address in a wide, sparse space would be ’nice’. Hashing was consid¬ 
ered, but rejected as complex. B-Tree or other sorted structure was considered but rejected as incurring too much creation and 
walk cost. Finally, it was decided to use the 4byte structure in decomposed sets, the first 2 bytes to identify one of 65,536 ’B’ 
networks, and then index by bits into 8 bytes for the class ’C’ instance. 

This method is unable to manage network allocations below a /24 horizon in CIDR [CID] notation, but this appears to be a 
useful minimum allocation anyway for routing 12 . 

This design had some very useful properties: 

[1] It is almost completely based on array indexing, and bit set/test in a block of 32 bytes. Thus it has constant lookup 
speed irrespective of how dense the array gets, and very amenable to being embedded directly in the main loop: In 
other words It should be fast. 

[2] It’s reasonably compact: 2megabytes of array. Irrespective of how large the onshore address space gets as long as it lies 
in a class C. 

[3] It’s fast to load/dump - assuming its pre-set in some external process. Programs can stream bytes into memory from 
disk pretty quickly. This is to compare with loading a tree or hash structure of some kind, or incurring load costs live to 
build a dynamic structure (which also adds code complexity) 

4.5. Nothing new under the sun. 

after the event it was discovered that this is almost the same as a database technique known as Bloom filters, where you hash 
(with collision) to a bit in a dense structure. This is being used by the NLANR Squid web and FTP cache software [SQU] to 
avoid the inter- cache check (ICP) overheads, and pass around hash lists of URLS held in cooperating caches, but with a 
10-20% false hitrate. Although acceptable in their context, this non-precise method would not work in the mirror project, 
without subsidiary collision removal with added processing cost and complexity. 

10 For that time. Moores law... 

' 1 With hindsight, this also means that normal table search or short linked lists could be used for processing logs live, with con¬ 
tained memory and speed issues. 

12 with /19 known to be the minimum US providers will inject in some cases 


Who’s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 


299 



5. Ask an expert.. 

On the point of coding this method, the problem and proposed solution was discussed casually with a work associate at I 
University of Queensland, a database specialist in the DSTC distributed data research unit. [COL] 

He very rapidly analyzed some core behavior of the processes and their outputs, that was going to affect how they w< 
manipulated: 

• The logfiles are on disk already. 

• The search/match key field is amenable to linear sorting, 
and from this basis three crucial observations were made: 

[1] The procedures are inherently amenable to batch processing. Unless the daemons were coded to log sub-classes at e: 
use, or processes set up to tail filter the logfiles to compute on-the-fly sum values, it was going to be a work cycle ba: 
on rolling logfiles and post-processing them. 

[2] Sorting is a well understood problem. The UNIX sort program is a very efficient both in space and time. 

[3] Once sorted, file merge is faster than almost any other process of selection. Repeat merges exploit the constant s 
costs. For multiple passes, the overheads drop immensely. 

It was decided to test these assertions in the project conditions, since coding hadn’t started and the tools (sort, awk, etc) w t 
there to use. We were given a copy of Edsger Dijkstra’s ’A discipline of programming’ from 1976 [DU], and shown a beau 
fully simple algorithm for sorted list operations, designed (in the work cited) to process tapefiles, under considerably me 
restrictive CPU and memory constraints. 


5.1. Is the sort command getting a bit long in the tooth? 

To our surprise, using the UNIX sort program actually proved to be a problem. At first sight, IP addresses are very amenat 
to sorting: 

• They are numerically based quantities. 

• They have a well understood ordering relationship. 

• In their ’true’ form, they are fixed-size quantities. 

Unfortunately not all of these properties are exploitable in the traditional ASCII representation used for human consumptioi 
the so-called ’dotted quad’. 

The syntax of ’numeric’ that sort uses for sort -n is based on integer and float values. Since a dotted-quad has 3 periods, 
looks like a float followed by arbitrary text which in context happens to be another float, but its not treated as such, and 
instead sorted by dictionary order. Thus the IP addresses fail to sort correctly since in dictionary order, 100 sorts before 99 (r 
we all know from using /bin/Is outputs where 1 10-19 100-199 appear before 2, 20-29, 3, 30. etc) 

By defining the sort field separator to be the stop character it was possible to achieve a partially correct order, which wf 
completely correct for records where was the canonical field separator and the dotted quad could be specified as 
sequence of key-fields each of which would be treated numerically, or where the IP address was guaranteed to be the fii 
field in the file (ie sort -1. -n +1 - 4) however this was not the case for at least one of the logfiles, and likely not to 1 
the case in many other instances. The effect is to cause some amount of dictionary mis-sorts for the first or last field of tl 
dotted-quad, with correct sorting order otherwise as a result of the prefix or suffix text included in the first and last field 
Since sort order was fundamental to the application of the sort/merge process, the project had no choice but to find a solutio 
if this method was to progress. 


5.2. Coding around sort 

Finally, it was decided to write a symmetric pair of filters to take addresses in dotted quad notation to a hex string, and bac 
again. 

• The hex representation was valid for use in a sort -n pass 

• It was also a fixed-length field, 

• It had a compact representation of only 8 bytes, 

• Hex is semi-understandable in context, where address/masks are often specified in hex (eg TFTP, ifeonfig and netsti 
contexts) 


300 


Who’s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 



5.3. Filters are your friends. 

The decision to write this process as a stdin/stdout filter pair was simple and obvious. If we learn three things from UNIX 
about filters, they are surely 

[1] filters are usually useful somewhere else. 

[2] filters are usually small. 

[3] filters are usually fast. 

The resulting inet2hex and hex2inet programs are 25 line C wrappers around the standard Internet library calls 
inet_addr () and inet_ntoa (). They do not implement sort -t or awk -F commandline functionality to specify 
record separation nor how to find the IP address, although this can be added very easily. 

The hex2inet program calls strtoul () as well as inet_ntoa () and appears to incur approximately double the time 
load accordingly (See Table 4). 


6. Dijkstras algorithm for merge/sort processing. 

Next a ’merge process’ function modeled on the Dijkstra algorithm had to be written. In essence, this assumes that a list of 
values selects the records by key, under a numeric-order test: 

if the record is ’less’ than the key { 

skip input until the record equals the key 

} 

while the record equals the key { 
process as ’match’ 

} 

if the record is ’greater’ than the key { 

skip keys until the key is greater or equal to the record in hand. 

} 

In this context, ’equal’ meant a test under the network mask, where each key was a mask/length pair in CIDR notation [CID]. 
a 125 line program sufficed to apply this rule, with an initial key table of at most 8192 masks and lengths applied discretely 
for each of 13 ’rules’ comprising the 3 major classes of user, and 10 sub-classes for the core target market. Admittedly this 
would require repeat passes, but this process was meant to be efficient once the sorting was done, and certainly appeared so. 


6.1. sort/merge really IS fast! 

Interestingly the time for this filter to run is completely overshadowed by the other activities. For a 130,000 line input file, the 
address select against a list of 3896 address/mask pairs completes in the same time a pass over 1/3 as many, in around 0.1 sec¬ 
onds elapsed time. 

The implementation time for the hex and merge filters was less than one mornings work. Subsequently more time has been 
spent refining log processing but no substantial change to these filters has been necessary. 

This also revealed another advantage over the proposed bloom filter. The original bitmap/table method was inherently a sim¬ 
ple binary check along the lines of: ’if in the bitmap, it belongs’ and would require either subsidiary methods to identify sub¬ 
classes, or another 13 instances of 2megabyte arrays. The sort based method on the other hand could be done as multiple 
passes over the same data, where the initial sort operation was common and the merge select function was fast, this is a low 
cost in both disk and time. 


6.2. Is batch processing fast enough? 

The process has since been expanded to encompass both HTML and CSV (excel) output, for all 13 targets classes of user, 
each class requiring a derivation of all usage as well as FTP and web usage counted separately, and a summary of the most 
popular package downloads. This is a total of 156 passes over the data, which completes in under 30 minutes every day, for 
current levels of usage by 50,000 users and a total filter list of 4500 IP addresses in 13 sets. 

The project management is comfortable that the majority of the processing time is spent in sort and analog, performing the 
statistics themselves and not filtering data to derive as input. (See Table 4). Sort is performed once for the entire session and 
its costs amortized over 144 of the passes over the data. The single most time-consuming activity is currently grep-ing a days 
events from an unrolled log of 85 Gigabytes, and performing some logfile format conversion prior to processing. 


Who’s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 


301 



for an input file of 43786 records: 


phase 


real 

user 

system 

inet2hex 


2.58 

2.11 

0.39 

sort 


57.50 

45.39 

4.88 

hex2inet 


6.20 

3.79 

0.69 

addrselect 


0.11 

0.06 

0.05 

inet2hex 1 sort 

1 addrselect1hex2inet 

55.99 

51.88 

7.28 

analog 


9.05 

8.23 

0.61 

inet2hex 1 sort 

addrselect1hex2inet1 analog 

38.95 

42.06 

4.93 


Table 4. average times for command execution and pipes (output > /dev/null) 

7. Conclusions 

The main outcomes are that the AARNet mirror project can process logfiles based on a good approximation of where peo; 
come from, in a timely manner, and in ways expected to scale with increased usage, certainly for the remainder of the pi 
project, and probably well into the ongoing life of the service. 


Some other more general observations... 

7.1. On an Internet, sort maybe does need some ’cat -v’ additions. 

The sort function is a general purpose tool, which can’t process IP address information directly and in an Internet, thats 
mild problem. The problem can be fixed with simple filters on the data, but that might not always be possible (and impli 
twice the headroom on-disk to hold intermediate forms in some cases). 

If a simple method exists to convert sort to know about addresses, it should be considered. Subject to conversion to a numer 
form (which is one system library procedure call and error check overhead) there is a good sort/order specification as sus 
gested above in this paper. 

Address/mask manipulations are also fundamental for an important subset of Internet systems administration. It might be nic 
to see some pipe/filter programs developed to reflect this: 

• egrep-like forms of specifying ’equals prefix under mask’ 

• sed-like forms of ’modify stream under prefix and mask’ 

7.2. DNS is not always there. Don’t depend on it for security (or logging) 

DNS, although a mission-critical service is not always able to deliver an answer. You can either stall to ask for a repeat, or lo, 
partial information, or as the project does, log both name and address and accept some fuzz in both. 

Internet addresses are fundamental to working out where people come from since DNS has never, and (especially now w 
face expansion of the gTLD space) cannot provide a reliable clue to geographic, cost or network location from name inform.‘• 
tion alone. 

Service providers should consider logging both address and DNS name if they can afford the overhead, and need to analyz 
the userbase for ’location’ in different ways. 


7.3. BGP and related resources can help you find your user patterns. 

If you run service where knowing who uses your facilities matters either for policy or cost reasons, it is worth considering 
liaison with the IP provider, or the nearest source of BGP information to get lists of addresses/masks and intuit their relation 
ship to where data comes from and goes to. With a little effort, a very accurate idea of where people come from either b) 
domain, or by source network can be found, which is likely to reflect cost issues in serving that data. 


7.4. Pipes win! small simple processes are way cool (but we knew that) 

Pipes, and unix shell-level methods still provide viable functional methods to process data. The proposed bloom filter methoc 
might work, and may yet be applied by modifications to the web and FTP daemons, to tag data by a list of assertions abou 
the data. While batch processing remains viable, it appears to add little overhead in coding and processing terms. 


302 


Who’s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 






7.5. Don’t hack from scratch: ask around and dig for old methods. 

Most importantly, we have re-learned an old lesson: don’t hack from scratch, ask around and you will find somebody under¬ 
stands your problem a lot better than you do. Its refreshing to find a 1976 algorithm for batch processing tapes works well to 
solve a 1990’s problem, and that small, fast programs can still be written in a GUI obsessed world. 


8. References 

[ANA] http: //WWW. statsiab.cam.ac.uk/~sreti/anaiog/ Stephen Turner, University of Cambridge Statistical Lab¬ 
oratory. E-mail: sretl@cam.ac.uk 

[CID] http: / /info.internet.isi.edu: 80/in-notes/rfc/fiies/rfci518. txt CIDR Address Allocation Architec¬ 
ture Y. Rekhter, T.J. Watson Research Center, IBM Corp. T. Li cisco Systems, September 1993 

[CIS] http://www.cisco.com/warp/public/732/netfiow/index.htmi Cisco Netflow Services and Applications 

Guide. 

[COL] http://www.it.uq.edu.au/personai/colomb/ Dr Robert Colomb,Senior Lecturer, School of Information Tech¬ 
nology, the University of Queensland. Private Communication 

[DIJ] A discipline of programming Edsger W. Dijkstra. Prentice-Hall, 1976. 

[SQU] http://squid.nianr.net/squid/FAQ/FAQ-i6.htmi Duane Wessels, National Laboratory for Applied Net¬ 
work Research. E-mail: wessels@ircache.net 


9. Acknowledgement 

Bob Colomb, Senior Lecturer, School of Information Technology, the University of Queensland who suggested an informa¬ 
tion systems approach to the problem, and provided the Dijkstra text, and a sounding board for the methodology adopted. 


Who’s Sucking My Data? - a 1970's Algorithm Rescues a 1990’s Problem 


303 



304 


Who s Sucking My Data? - a 1970’s Algorithm Rescues a 1990’s Problem 



Linux Clusters 
for 

Parallel Computing: 
An Overview 

(Beowulf cluster using 

Extreme Linux from Red Hat Software Inc.) 

Robert Hart 

Director, Support Services 
Red Hat Software Inc. 

August 1998 


Abstract: with PC hardware reaching commodity level prices, the research com¬ 
munity is moving to replace existing high speed computers with clusters (also known 
as ‘ ‘ compute farms”) of PCs running Linux. Clusters of over 1000 nodes costing 
about $1 million are planned for 1999, replacing a prospective $20 million high speed 
computer purchase. 

This paper overviews the technology and its applications within and beyond the re¬ 
search environment. It also provides pointers to commercial hardware and software 
that is available to provide off the shelf solutions. 


Using clustered Linux PCs for parallel processing 


305 



Introduction 

In the fields of applied mathematics, science and engineering there are many types of 
problem that cannot be solved except through the use of large and complex computer 
programs. These programs are computationally intensive and in many cases allow a 
solution method that (at least in part) can be performed in parallel, with two or more 
separate calculations occurring simultaneously. 

The range of problems that have at lease a reasonable amount of potential parallelism 
is very large: from the bio sciences through all fields of engineering into theoretical 
physics. 

Large, specially designed “super” computers have frequently been used to take ad¬ 
vantage of the parallelism in calculation to speed the compete process. These comput¬ 
ers have the disadvantage of being (essentially) monolithic (that is “one size suits 
all”) and they are expensive. 

In the late 1980’s, networking technology allowed individuals and organisations to 
link together high performance Unix workstations into “compute farms” or “clus¬ 
ters”, taking advantage of the computational parallelism of a particular problem to 
perform different parts of the calculation simultaneously on several computers. 

This approach was quite successful, but suffered from three limitations:- 

Network technology was fairly slow and there was a consequent communication 
bottleneck; 

Individual high performance workstations were expensive. 

The tools to handle the cluster (distribution of jobs, inter-process communication 
between nodes etc.) did not exist and had to be developed. 

Despite this, message passing software was developed and this class of cluster was 
extensively used, often as a development tool on which code was tested prior to trans¬ 
ferring it to the monolithic super computer. 

In the mid-1990’s, several critical advances occurred:- 

• A network speed increase of 10 fold with 100BaseT; 

• The availability of full duplex Ethernet at 100BaseT speeds; 

• The availability of switched network circuits, including full crossbar switches 
for proprietary network technology such as Myrinet (from Myricom). 

• The computing power of desktop PCs is approached that of low end worksta¬ 
tions but at significantly lower cost. 

• The availability of a powerful, stable Unix-like operating system for X86 class 
PCs with easy access to source code (Linux). 

This lead researchers in several locations to believe that useful parallel computing 
power for computationally intensive processing could be built using commodity hard¬ 
ware and free software. The current Beowulf clustering technology (available from 
Red Hat Software Inc as Extreme Linux) is a direct descendant of the work that oc¬ 
curred in the mid 1990s. 

As Linux is a Unix-like operating system, porting the exisiting cluster software and li¬ 
braries was seen to be a (relatively) easy job. 


306 


Using clustered Linux PCs for parallel processing 



The Importance of source code availability 

Whilst the hardware advances in the mid 1990s were essential enabling technologies, 
the critical item that was missing was the operating system to operate on each cluster 
node. 

Most PC operating systems at that time did not adequately meet the needs for stabil¬ 
ity, true multi-tasking and strong network support. In particular, a version of Unix was 
strongly preferred as this would greatly ease the porting of existing software. 

As the clustering community was very small and their needs so specialised, the atten¬ 
tion they would receive from the existing proprietary X86 Unix variants would be 
small. Bug fixes and performance enhancements would therefore take considerable 
amounts of time as these had to come from the owner of the source code. 

The availability of Linux as a usable, source code available Unix-like operating sys¬ 
tem in about mid 1993 solved the dilemma: - 

Whilst at that time Linux had limited hardware support for high speed network 
technology, these could be (and were) written as the source code for the oper¬ 
ating system kernel was freely available. 

No software is free of bugs and operating systems are no exception. As the Linux 
code was available (and was actively being developed across the Internet), 
bugs that were encountered could be located, fixed and rolled into the operat¬ 
ing system at need. 

Performance enhancements to the operating system could similarly be effected by 
the users. 

All users benefited from the work of everyone else - immediately. 

Linux could be downloaded from the Internet for free. 

Linux was designed for the X86 hardware (although it now runs on a large range 
of hardware). 

As the researchers using clustering technology were already used to exchanging and 
sharing their software (using the Internet), the Linux development model meshed well 
with their existing operations. 


Clusters versus Super computers 

Extreme Linux clusters are usually compared to traditional super computers. Tradi¬ 
tional super computers are “super” in many (if not all) of the following ways:- 

Floating point operations (FLOPS) 

Memory bandwidth 
Memory size 
Disk access speed 
Data set size 
Price 

Extreme Linux clusters cannot yet compete on FLOPS (although the 1000+ node clus¬ 
ters planned for 1999 will be competitive) or memory bandwidth (where Extreme 
Linux clusters have about 1/64^ the memory bandwidth of a Cray CPU). 

In terms of performance per unit cost, Extreme Linux clusters lead traditional super 
computers by at least one order of magnitude. For example, the 1,000 node cluster 
planned at Fermi Lab will cost about US$1 million whereas a super computer of suf¬ 
ficient capacity for the task would cost about US$20-30 million. 


Using clustered Linux PCs for parallel processing 


307 



(An interesting point about Linux clusters is that the investment in computing power 
can be staged: A cluster can be started small and expanded in a way that is not pos¬ 
sible with conventional super computers.) 

In terms of FLOPS, fewer CPUs is always better as it is nearly always the case that 
communications overhead is the limiting factor. Whilst Linux already supports SMP 
(Symmetric Multi-Processing) for Intel CPUs, it does not scale too well in the 2.0.x 
kernel series (it is coarse grained). The 2.2 kernel series under development (as kernel 
2.1.x) is already scaling nearly linearly to 8 CPUs and this will be of significance to 
Extreme Linux clusters from late 1998 onwards. 

Using Extreme Linux clusters 

As this technology is new and has only recently emerged from NASA, the high en¬ 
ergy physics labs etc., it is not yet an ‘end user’ technology - some assembly is (usu¬ 
ally) required ! 

Essentially, a process running on an Extreme Linux cluster runs on a single master 
node. This node then distributes individual sub-processes to the sub nodes in the clus¬ 
ter. As these sub-processes complete, the results of their sub-process are returned to 
the master node. All communication is handled by an appropriate messaging library 
(and special patches to the kernel of the master and each sub node). 

An important point to note is that communication between sub-nodes is also allowed. 
Thus sub-nodes can cooperate without requiring the intervention of the master node 
(so a sub-node can be told to wait for data from another sub-node or to pass data to 
another sub-node). 

Once a cluster has been set up, in order to use it, code must be rewritten for parallel 
operation. On Linux clusters, the parallelism is provided by an appropriate library set 
(the emergent standard is MPI). 

Whilst there are software packages available to assist in parallelising code, the optimi¬ 
sation offered is not the maximum possible. This can really only be achieved by hand 
coding the optimisation. However, for many cases, the automatic parallelisation pro¬ 
cess is sufficient of itself (e.g. for testing a model’s viability) or a useful tool as a first 
step to optimisation by hand. 

There are also tools available to examine code and provide diagnostic assistance in 
parallelising code as well as tools to monitor running processes on a cluster, which aid 
in visualising the distributed process space. 

Full parallelisation (which is really only of benefit if a process is to be run many 
times) usually invokes the parallel platform paradox: 

The average time to develop a moderate sized application on the latest parallel su¬ 
percomputer is equivalent to the half-life of the latest parallel supercomputer. 

With Extreme Linux clusters, this is overcome to a large extent as the platform is a 
standard Unix system. The computing power can be increased in a variety of ways:- 
increase the number of nodes; 

increase the compute power of each node (more RAM, faster CPU etc.); 
increase the inter-node bandwidth. 


308 


Using clustered Linux PCs for parallel processing 



Classes of problems and cluster design 

Computationally intensive problems range from those that are entirely sequential (i.e. 
no possibility of parallel computation) through to “embarrassingly” parallel. 

It is however insufficient to look at how easy it is to split a computation into sub¬ 
processes - a significant factor in total computational time is the inter-node com¬ 
munication requirements of the processes. 

At the extreme are so called “embarrassingly parallel” problems where the inter¬ 
node communication is minimal. 

rendering frames for movie special effects (a cluster DEC Alpha computers run¬ 
ning Red Hat Linux was used for the Titanic special effects by Digital Do¬ 
main) requires no inter sub-node communication. Once the main node has 
passed off the data required for the frame to be processed by an individual sub 
node, it requires no further communication bandwidth until it completes ren¬ 
dering and wishes to pass back the completed frame. The cluster for this is a 
simple trade off between total time required and money available, with the re¬ 
lationship being virtually linear. 

a cluster handling super-collider particle collision data (where each node handles 
an individual event) has a similar low communication requirement. Here how¬ 
ever, the cluster must be designed with sufficient nodes to handle the event 
load placed on it in both computational and data movement requirements. 
Fermi Lab is planning several clusters over 1000 nodes in size for 1999. 

CERN is also investigating similar proposals. 

Generally however, problems have a much greater requirement than this for inter¬ 
node communication and that inter-node communication limits are all too frequently 
the bottleneck in cluster performance. 

Experience has shown that switched, high speed (100BaseT, full duplex) or faster net¬ 
works are required. Full crossbar switched circuits (such as provided by Myricom) 
and Gigabit Ethernet have already proven their benefit. 

Inter-node communications speeds are now approaching the level where the internal 
operating system handling of data to and from the physical layer is significant; hence 
the interest in software architectures such as VIA and U-Net which aim to eliminate 
the need to copy data in memory as part of the transfer to the wire. 

Commercial products for Extreme Linux clusters 

Even though Extreme Linux cluster technology is relatively new, it is already attract¬ 
ing commercial products. In the USA, Alta Technology, Paralogic and DCG are offer¬ 
ing Intel and Alpha CPU based pre-built clusters. 

On the software front, parallelising compilers are available from The Portland Group, 
and Kuck and Associates. The Numerical Analysis Group also have a parallel version 
of their libraries for Extreme Linux clusters. 

On the application front, Fluent is producing a parallel version of its computational 
fluid dynamics (CFD) package and there are rumours of Finite Element Analysis 
(FEA) packages in the works for Extreme Linux clusters. 


Using clustered Linux PCs for parallel processing 


309 



The road ahead 

Looking at 1999 and beyond, it would appear that Extreme Linux clusters will con¬ 
tinue to increase in performance. Some of the things that are needed (and should hap¬ 
pen) are:- 

Significantly improved Intel SMP scalability with the new 2.1.x (to be released as 
2.2.x) kernels (probably due out in late 1998) together with appropriate moth¬ 
erboard availability; 

Increased capability on Alpha CPUs: SMP, maths libraries, commercial software 
(Intel’s release of its high performance maths libraries to the Extreme Linux 
group was a major assistance in improving performance); 

Faster inter-node communication through a combination of faster networks and 
the adoption of new message passing models (such as VIA, U-Net, ST etc.); 

Increased capacity CPUs (higher speed Pentium class CPUs); 

New CPUs (Merced is now due in early 2000 and discussions on a Linux port are 
already underway with Intel); 

The possible use of cheaper CPU technology (such as the Strong ARM). Corel 
Computers are introducing a StrongARM based sytem using Red Hat Linux. 
Whilst this does not have an FPU, it is of interest due to low cost, low power 
requirements and easy of rack mounting (research groups in the US are con¬ 
templating building clusters using this technology); 

Improved Intel BIOS to assist in managing large clusters (e.g. a serial BIOS which 
would allow nodes to be configured without attaching a keyboard or inserting 
a video card - essentially allowing OOB configuration and management 
through serial multiplexors); 

Parallel debuggers - as yet these do not exist but work on this complex field is un¬ 
derway; 

Increasing off the shelf hardware and software; 

The 315th fastest computer in the world consists of 68 Alpha CPU PCs running 
Red Hat Linux at Los Alamos National Laboratory. 

Since it commenced service in May 1998, it has not required 
rebooting. 


310 


Using clustered Linux PCs for parallel processing 



Links 

http://www.extremelinux.org the home site of the Extreme Linux Project. The slides 
for the workshop presentations made at Linux Expo in late May 1998 are online and 
provide much interesting useful information. 

http://cesdisl.gsfc.nasa.gov/linux/beowulf/ the Beowulf project home page. This has 
many links to related material, including links to the latest Gibabit Ethernet drivers 
(http://cesdis.gsfc.nasa.gov/linux/drivers/yellowfm.html). It also has links to Extreme 
Linux clusters operating around the world- interestingly, none are listed for the UK. 

http://www.fluent.com/ Fluent produce commercial software for CFD which 
NASA’s Goddard Space Flight centre has paid to have rewritten to use MPI (contact 
is Clark Mobarry mobarry@maxwell.gsfc.nasa.gov). 

http://www.myri.com/ Miricom produce high performance switches (including full 
crossbar 16 port switches for 100BaseT) well suited for parallel computing networks. 

http://www.plogic.com Paralogic Inc. provide custom hardware solutions for parallel 
computing. They also provide software to assist in designing code to run on net¬ 
worked parallel computer systems such as their BERT 77 package for analysing and 
parallelising FORTRAN 77 code. 

http://www.altatech.com Alta Technology provide pre-built and installed Extreme 
Linux clusters based on Alpha or Intel hardware. 

http://bullard.esc.cam.ac.uk/~walker/ a CFD PhD student using Beowulf clusters to 
handle his research computations. 

http://www.nag.co.uk The Numerical Algorithms Group provides parallel capable li¬ 
braries to work on Beowulf clusters. 

http://www.viarch.org/this is the home site of the Virtual Interface Architecture. It 
is rumoured that work on VIA for Linux is underway, but at this point no reference is 
available. 

http://www2.cs.corneIl.edu/U-Net/Default.html U-Netis an alternative low latency, 
high bandwidth networking architecture (already available for Linux). 

http://www.corelcomputer.com/ Corel Computer are introducing a Strong ARM 
based computer using Red Hat Linux. 

http://www.techweb.com/wire/story/TWB19980706S0015 The 315 th fastest paral¬ 
lel computer is 68 Alpha CPU machines running Red Hat Extreme Linux and it cost 
only $150,000 to build! 


Using clustered Linux PCs for parallel processing 


311 

































. 





















• 

































1 ' 





' 








. 













312 


Using clustered Linux PCs for parallel processing 



Service Failover via Dynamic DNS Updates 


Peter Gray 

Information Technology Services 
University of Wollongong 
pdg@uow.edu.au 

July, 1998 


Introduction 


With increasing dependency on computer services such as electronic mail and the world wide 
web, users expect continuous machine and service availability. This puts increased pressure 
on IT providers to meet users’ expectations. This paper presents a relatively simple and 
cheap method of allowing failover of some useful user services to alternative machines in the 
case of computer or network failure. 


Approaches to High Availability (HA) 

Fault Tolerance 

Fault tolerant computers have been the mainstay of high availability systems for some time. 
In this class of machine, most, if not all hardware components are replicated one or more 
times and the machine continues to operate with almost no noticeable degradation in the 
case of a single or even multiple component failures. 

The downsides of this type of solution are the very high initial cost of the hardware and 
the fact that the users are not protected against software failure, either of the operating 
system or the applications. In addition, administration is difficult since the machine can not 
be taken down for administrative functions. 


High Availability Clusters 

Some of the problems of fault tolerant machines are solved by high availability (HA) clustering. 
In these configurations, 2 or more machines share access to disks. In the case of hardware or 
software failure, it is possible for another machine in the cluster to take over the supply of 
services usually performed by the failed machine. This configuration has another advantage 
in that some form of load balancing may also be possible. 


Service Failover by Dynamic DNS Updates 


313 



High purchase price is still a problem with clustering and, in addition, any TCP connection 
will normally be broken when a machine fails resulting in the users having to reconnect. 
Some applications may do this automatically without the user being aware of it but the more 
co mm on case will require users to restart their application. Administration is still difficult. 

Another approach is to share disks via the network rather than directly connecting them 
to the clustered machines. In this case a so-called “network appliance” can be used as a data 
repository and all access to data is via NFS (or some other remote file access protocol). This 
allows normal machines to be used as application servers. However, clustering support at 
the application server side is required. Many database systems offer support for clustering in 
some form. 


Failover via Name Services 

In some cases, important services axe relatively easy to replicate. Some examples are: 

WWW server: In the case where a web server is simply serving pages and running simple 
CGI scripts, simply replicating the document storage directory tree to a remote machine 
and running an equivalent WWW server on that machine is enough to provide a backup 
service. 

Mail delivery: Sendmail (or equivalent) running on multiple machines with access to the 
same password and alias information (via NIS+ or replicating the appropriate files) 
offers multiple servers which users may utilise for sending mail. 

Mail reading: This is possible if the area where mailboxes are stored is available across 
multiple machines. NFS (or equivalent) provides enough capability for this purpose. Of 
course, if the mailboxes are not available because of machine or network failure, backup 
servers may not function. 

WWW proxy: This case is particularly simple since WWW proxy servers should be 
transparent to the user in any case. Simply running additional proxy servers should be 
enough to provide the possibility of users switching servers in the event of failure. 

In cases similar to those above it is possible to “cheat” and cause applications to move 
to alternate servers in the case of failure by “tinkering” with the answers supplied by name 
services. 

For example, if the local proxy web server is referenced as proxy-name which is a CNAME 
pointing to machine-A then if machine-A fails then name services could be modified to start 
returning machine-B as the target of the CNAME. The procedure to monitor machine state 
and update name services can be automated. 


History 

The idea of name servers which return varying answers to the same question is not new. A 
program to perform this sort of underhandedness in the case of DNS already exists in the 


314 


Service Failover by Dynamic DNS Updates 



form of groupd from UCLA 1 , groupd implements a very limited subset of the DNS protocol. 
By designating authority for a domain to a groupd server, DNS lookups of a name return 
different IP addresses depending on machine availability. The groupd server monitors a list 
of hosts and updates its internal databases accordingly. Users simply connect to servers, but 
which server they are actually talking to may vary depending on which machines are up or 
busy, since groupd also allows for load balancing. 

groupd is quite useful but it has drawbacks as well. Because it looks to the outside 
world like a normal DNS server, it is impossible to run a DNS server and groupd on the 
same machine. Also, it does not support CNAMES, preferring instead to return differing A 
records. It does not support PTR lookups for obvious reasons. 


Janadu 

Originally janadu (JAva NAmeD Updater) was a program which monitored a list of hosts 
and where necessary updated DNS via the DNS Dynamic Update Protocol (RFC 2136). The 
package has grown to include 3 additional utilities. The first is a simple utility to send a 
query to a DNS server and print the reply in a. similar fashion to the common dig utility. 
The second is a similar utility to send a single update request to a server and the third is a 
program to provide a GUI for both sending DNS queries and updates. 

janadu is completely written in the Java programming language and should be 100% 
portable across platforms supporting a standard Java Virtual Machine(JVM). 


Why Java? 

I had only superficial experience with Java before this project but had already been impressed 
by the language’s features. The janadu project gave me an opportunity to write a serious 
application which would guarantee to be portable. In addition, the language’s advanced 
features such as threads and networking classes would make code development fast. 


Host Monitoring 

For a configured set of DNS resource records (RRs) janadu monitors a set of host/service 
pairs. In the simplest case, monitoring can be performed using the echo protocol which will 
simply determine is a host is up and/or network reachable. Because of Java’s object orientated 
design, adding other monitoring protocols is easy. The other monitoring method implemented 
in the current version is an “expect” 2 style method. A telnet style connection to the host 
is instigated and 1 or more lines are sent to the target host and the responses are matched 
against user supplied regular expressions (challenge/response pairs). If the connection fails, 
or the target host does not send valid replies to the challenge strings sent from the janadu 
process, the host/service is considered non-operational. The “expect” style monitoring allows 
janadu to monitor specific services rather than the machine as a whole. Thus if the WWW 

1 origin not entirely certain to the author 

2 based on the popular PERL based utility expect 


Service Failover by Dynamic DNS Updates 


315 



server fails but the machine remains operational, only the DNS RRs relating to the WWW 
service need to be updated. 


DNS Updates 

Recent versions of named have included an implementation of the dynamic DNS update proto¬ 
col defined in RFC2136. This allows authorised hosts to remotely update the DNS databases. 
The current release of named does not correctly propagate such dynamic updates unless the 
update is sent to the master server for the domain in question. Future releases will remove 
this restriction. 

The current version of janadu only supports a small number of RR types, but does include 
A, CNAME, NS, SOA and PTR. 


Configuration 

janadu reads a single configuration file which describes what actions it should perform. The 
following is an example configuration file. 


Monitor 

{ 


key 

type 

ttl 

interval 

retries 

dns_server 

zone 

protocol { 


smtp.uow.edu.au 

cname 

60 

5 

2 

draci.its.uow.edu.au 
uow.edu.au 


expect { 


port 25 // SMTP port 

send "null" // send nothing 

expect ""220.*" 
timeout 5 


host 

{ 


> 

host 

{ 


name worner.uow.edu.au 

preference 10 // preferred host 

data { worner.uow.edu.au } 


name wumpus.its.uow.edu.au 


316 


Service Failover by Dynamic DNS Updates 



preference 20 

data { wumpus.its.uow.edu.au } 

} 

> 


In the above case, 2 hosts are being monitored, worner.uow.edu.au and 
wumpus. its. uow . edu. au. The CNAME smtp. uow . edu. au is the DNS record being updated 
on the DNS server draci.its.uow.edu.au and the DNS zone in question is uow.edu.au. 
Hosts are polled every 5 seconds and 2 attempts are made to contact a host before it is 
considered non-operational. The method of probing a host is using the echo protocol. If 
more than 1 of the hosts being monitored is responding to probes, the one with the lowest 
preference is selected. 


Security 

Obviously, the ability to update DNS records over a network presents a potential security 
problem. The current version of named (8.1.1) restricts dynamic updates to zones which have 
the feature explicitly enabled and allows the DNS administrator to only allow updates from 
specific hosts (IP addresses). There is also a secure version of the DNS protocol which would 
offer much improved security and this will hopefully be used in a future version of j anadu. 


Other Utilities 

Since j anadu required writing a collection of classes to handle various aspects of interaction 
with DNS, it was an easy task once the core application was completed to add additional 
applications to perform other useful DNS related activities. The package includes a stand 
alone utility similar to dig 3 for sending a single query to a DNS server and printing the reply, 
a stand alone utility for sending a DNS update to a server (this allows for batching updates if 
used with a scripting language) and utility with a GUI for both sending queries and updates 
to a DNS server. 

These stand alone utilities offer the prospect of a site totally administering it’s DNS 
systems via an application interface rather than editing the named data files. This would 
remove the common problem of mistakes during editing of DNS configuration files since all 
updates to the files would be performed by named. The GUI utility runs either as a stand 
alone JAVA application or an applet under the control of a JAVA enabled browser such a 
SUN’s HotJava or Netscape. 


Futures 

The software is currently under development. The next major development will be the im¬ 
plementation of the DNS security extensions to ensure the authenticity of the DNS updates. 

3 Domain Information Gouger 


Service Failover hy Dynamic DNS Updates 


317 



Availability 

The software is currently not available for distribution, but is available to people wishing to act 
as testers. Interested parties should email the author. The URL http://janadu.uow.edu.au/ 
allows for running the GUI based utility by remote users with JAVA enabled browsers. 


318 


Service Failover by Dynamic DNS Updates 



PARMON: A Comprehensive Cluster Monitoring System 


RAJKUMAR 

School of Computing Science 
Queensland University of Technology 
825a, PLAS Lab, S-Block 
Gardens Point Campus, Brisbane, Australia 
raikumar@fit.qut.edu.au 


KRISHNA MOHAN and BINDU GOPAL 

Operating Systems Group 
Centre for Development of Advanced Computing 
2/1, Ramanashree Plaza 
Brunton Road. Bangalore, India 
{krishmo, bindug}@cdacb.emet.in 


Abstract - Workstation clusters have off late become a cost-effective solution for high 
performance computing. C-DAC’s PARAM OpenFrame is a large cluster of high performance 
workstations interconnected through low-latency, high bandwidth communication networks. 
Monitoring such huge systems is a tedious and challenging task since typical workstations 
are designed to work as a standalone system, rather than a part of workstation 
clusters. System administrators require tools to effectively monitor such huge systems. 
PARMON provides the solution to this challenging problem. 

PARMON is a portable, flexible, interactive, scalable, location-transparent, and 
comprehensive environment for monitoring of large clusters. It follows client-server 
methodology and provides transparent access to all nodes to be monitored from a monitoring 
machine. PARMON allows to monitor critical system resources activities and their utilization 
at three different levels: entire system, node, and component level. It allows monitoring 
multiple instances of the same component such as CPU in SMP node. 

The two major components of PARMON are parmon-server —system resource activities 
and utilization information provider and parmon-client —GUI based client responsible for 
interacting with parmon-server and users for data gathering in real-time and presenting 
information graphically for visualization. The PARMON Client is designed, developed, and 
implemented using the state-of-the-art object-oriented, client-server, and Java computing 
technologies. The PARMON-server is developed as a multithreaded server using 
POSIX/Solaris threads and C as Java does not support interfaces to access system internals. 
PARMON is being successfully used in monitoring PARAM OpenFrame Supercomputer, 
which is a cluster of 48 Ultra-4 workstations running SUN-Solaris operating system. 

Keywords: Clusters, Single System Image, Monitoring, System Activities, and Resource 
Utilization. 

1. Motivations and Goals 

During 1980s, computer scientists believed that the performance of the computer can be improved by 
creating faster, more efficient processors. But this idea is being challenged by the concept of 
clustering which, essentially, means linking two or more computers to perform functions. The goal is 
to develop infrastructure (both hardware and software) so that end users need not know on which 
computer they are actually working. This enables multiple computers to work together as one system. 
Rapid changes in both areas of computing and communication (availability of commodity high- 
performance microprocessors and high-speed networks) facilitate this transition. These two enabling 
technologies are making Networks of Workstations (NOW) or Clusters of Workstations (COW) (in 
short, Network of Computers—PCs/Workstations) an appealing vehicle for parallel processing [1] 
leading to low-cost commodity supercomputing. 


PARMON: A Comprehensive Cluster Monitoring System 


319 



Computer scientists for years have tackled the problem of adding more performance while 
maintaining the integrity of systems. A major advancement that addresses both concerns is called 
clustering or cluster computer in which multiple computers are linked together to work as one system. 

Monitoring of a cluster computer is a tedious and challenging task since typical workstations are 
designed to work as a standalone system, rather than a part of workstation clusters. This can be eased 
by software systems, which allows monitoring the entire system at different levels through an 
integrated GUI display. PARMON is one such system, which allows to monitor entire cluster, a single 
node, a component of node, or activities of a component. A node can have multiple instances of the 
same component types and PARMON allows to monitor such multiple instances. 

As stated above, high performance computing on commodity hardware is gaining wide acceptance. 
For this to be practicable, it is important that systems provide a single system image at any one (or 
more) of the following levels: 

• Hardware Level 

• Operating System Level 

• Message Passing Interfaces Level 

• Language/Compiler Level 

• Tools/Application Level 

The transparent access to cluster computer is possible on a system exhibiting a single system 
image. The above approaches for achieving unified access to system resources have been discussed 
elsewhere (2]. In this paper, we discuss a cluster monitoring system, PARMON, which provides a 
unified means to monitor cluster computer at tools/application level. 

A large network interconnects many computers from different vendors having different machine 
architectures and running different operating systems. In such a network of heterogeneous systems, 
the monitoring system must be portable. This is achieved by developing PARMON using Java 
programming language. 

The motivations for using Java are its features as highlighted by its definition [3], “Java: A simple, 
object-oriented, distributed, interpreted, robust, secure, architecture neutral, portable, high- 
performance, multi-threaded, and dynamic language”. Among these features, Java makes network 
programming easier by encapsulating connection functionality in socket classes [4], Java is mostly 
used in writing distributed computing programs or GUI based programs because of its portability 
feature and Java goes a lot further than most languages to obtain not mainly just portability but 
identical program behavior on different platforms. Java is mainly intended for the development of 
object-oriented network based software for Internet applications. Networking, multithreading and 
GUI-components features of Java have been used extensively for developing PARMON client. 

2. Overview of C-DAC HPCC Software Architecture 

C-DAC HPCC (High Performance Computing and Communication) software is an open and flexible 
parallel-and-distributed processing environment for a cluster of UNIX workstations [5], The 
components of C-DAC software architecture are shown in Figure 1. The software architecture allows 
the collection of workstations to be viewed as independent workstations, cluster of workstations, or a 
MPP (Massively Parallel Processing) system connected through a scalable low latency and high 
bandwidth interconnection network. 

The HPCC software allows users to develop and execute sequential, message passing and data 
parallel programs. Efficient MPI (Message Passing Interface) implementation provides the required 
support for developing message passing applications, while support for data parallel programs is 
provided through HPF (High Performance Fortran). High performance communication protocols 
(Light-Weight Protocols) and a rich set of program development, system management and software 
engineering tools are aimed at offering high performance services and usability. 

C-DAC HPCC software is a complete solution for enterprises that need to create and execute 


320 


PARMON: A Comprehensive Cluster Monitoring System 




parallel programs on UNIX clusters. Performance for both parallel and distributed applications is 
guaranteed through efficient implementation of lightweight communication protocols. In addition, the 
environment supports effective monitoring and administration of the multiple resources of large 
UNIX clusters. 


appt irATrnwe 
AirLlLA 1 lUJNtj 

P\ M Parallel | Development Tools 




,*>■ 


\ km 


M File ■ 

•" ■ :• -system - 


F90EDE, DIVIA 


mm 


t y,'' * 

m. 


-SYSTEM . 
MANAGEMENT 


C-PFS 1 Languages 
C,F77,F90 


Message Passing Interfaces 


> TOOLS "B C-MPI, PVM 

’fj, * 


T! Linr • Lin A 1 ! 

Light Weight Protocols 


SOLARIS 


CLUSTER HARDWARE 




Figure 1: C-DAC HPCC Software Architecture 


The following section discusses the components of C-DAC HPCC system software in brief: 

PULP: Light weight communication software. Layered over ParamNet and Myrinet for UNIX 
Clusters. 

C-MPI: Optimized implementation of MPI for Cluster of Multi Processors (CLUMPS). Both point- 
to-point and collective calls have been optimized. Effectively uses both shared and distributed 
memory of CLUMPS. 

C-PFS: Parallel File System. Provides MPI-IO file system interface to parallel applications. 

CAF90: Fortran 90 compiler. Jointly developed by C-DAC and APOGEE, USA. Integrated 
development environment with debugging, profiling, browsing and project management support. 

C-F77to90: A tool for converting FORTRAN-77 to Fortran 90 programs. An essential product for 
programmers who have large FORTRAN 77 code to maintain. 

DIViA: Parallel program correctness and performance debugger. Detects communication bottlenecks 
and supports message debugging. Provides architecture and communication layer neutrality. 

PARMON: Cluster monitoring tool. Monitors the cluster as a unified resource. Helps in detecting 
performance bottlenecks and also acts as an interface to diagnostic tools. 

PARCOM: Parallel Unix Commands. Provides parallel extensions to traditional UNIX commands. 

MetricAdvisor: Software engineering tool for metrics. Evaluates Halstead, McCabe, Complexity 
Density, and Fan-in and Fan-out metrices. 

The detailed discussion of the above components except PARMON is beyond the scope of this 
paper and the reader is advised to refer to the C-DAC website, http://cdacb.emet.in . or the paper [5] 
for details. The remaining part of this paper discusses PARMON system model and architecture, 
monitoring system resources and activities with PARMON, implementation of PARMON, related 
works, and future directions. 


PARMON: A Comprehensive Cluster Monitoring System 


321 





3. PARMON Overview 

PARMON allows the user to monitor system activities and resource utilization of various components 
of workstation clusters. It monitors the machine at various levels: component, node and the entire 
system level exhibiting a single system image. PARMON allows to monitor the following: 

• Aggregate of System Resources Utilization 

• Process Activities 

• System Log Activities 

• Kernel Activities 

• Multiple instances of the same resource 

PARMON allows to define events and its automatic triggering whenever event condition is met 
during runtime. It also provides physical and logical views of the components of the system. 

The important features of PARMON include the following: 

• Exploits developments of latest technologies, hardware and software features for 
communication and imaging. 

• Allows the user to build system database (nodes and groups) comprising node name, 
communication interfaces, and specific area of a disk to be monitored. 

• Supports listing of system information and machine configuration of all nodes of a cluster. 

• Allows instmmentation of system resources such as CPU, Disk, Memory, and Network and 
their parameters both at macro and micro level. 

• Supports monitoring of cluster at node level, group level, or entire system level and thus 
exhibits a single system image. 

• Allows the parallel execution of selected operations on a single or group of workstations, real¬ 
time and interactive resource monitoring (e.g., processor, memory, disk, and network 
utilization), and normal maintenance tasks such as node or cluster shutdown. 

• PARMON Client is portable to all platforms supporting Java Runtime System, JVM (Java 
Virtual Machine) and PARMON Server is portable across all machines running Solaris. 

4. PARMON System Model: Architecture 

PARMON consists of the parmon-client and the parmon-server (parmond). The system model is 
shown in Figure 2 and it follows client-server paradigm with nodes to be monitored acting as servers 
and monitoring system acting as a client. The cluster nodes can be monitored from any workstation, 
PC, or a node of the cluster itself. 



Cluster of Workstations 
COW/NOW 

Figure 2: PARMON Cluster Monitoring System Model 


322 


PARMON: A Comprehensive Cluster Monitoring System 








The PARMON server is loaded on all the nodes that need to be monitored. A client requests for 
parameters through message passing and gets server response in the form of messages. The client 
interprets the messages and converts them into appropriate format for graphical visualization. 

A client can either monitor all the nodes or selectively monitor a few nodes of the cluster. For 
effective monitoring, the concept of group is supported. A set of nodes forms a group and nodes are 
selected based on the allocation of resources to various user groups. Such grouping mechanism helps 
in monitoring and gathering usability statistics, with which the system administrator can change the 
resource allocation strategy. 

5. Monitoring System Activities and Resource Utilization 

The following section discusses monitoring of system activities and resource utilization using 
PARMON. This knowledge helps in understanding techniques used in the design, development and 
implementation of PARMON. The user interacts with PARMON through PARMON-Launcher shown 
in Figure 3 to monitor utilization of system resources such as Memory and Disk and system activities 
such as execution of program [6]. It also allows software instrumentation of resource activity 
parameters through kemel-data-catalog. 



Figure 3: PARMON Launcher 


5.1 Aggregation in Visualization of Resource Utilization 

The use of aggregation in visualization allows to scale the visualization to whole cluster. The 
administrator can create user-groups containing a set of nodes based on the allocation of resources 
policy and monitor them using PARMON as shown in the Figure 4. The same statistics for different 
nodes are combined to get a single statistic and this technique is called group/machine utilization. The 
results shown in Figure 4 can also be presented using pie charts or bar graphs in PARMON. 


PARMON: A Comprehensive Cluster Monitoring System 


323 


























i~l. 'T-zm i _ 

File TimeSat 

Croup Name 

Num Of Nodes | CPU Usage 

Disk Usage 

Mem Usage 

Kr!sh_Group 

4 

14* : 

98* 

84* 

Hardware 

3 

. 12* 

90* 

84* 

HPCC.Group 

4 12* 

98* 

84% 

SSDG_Croup 

sV:?; 

1 1 * 

98* 

04* 

Networklng.Croup 

■ 3 

X » a ” * 

12* 

' >-'2- ' >' _ 

- 98* ■ ••• 

84* 


12* 

98* 

84* 


Figure 4: Resource Utilization by Groups in a Cluster 


5.2 Process Activities 

The utilization of CPU resource can be measured by monitoring process activities, which helps in 
identifying CPU/Memory-intensive processes. The parameters that can be monitored using PARMON 
are: pid, command, uname, uid, nice, status, user (%), sys (%), total (%), total CPU (%), and start up 
time. PARMON continuously updates the process and system data at a user specified sampling 
interval. The snapshot of process activities is shown in Figure 5 and this information can be order 
based on selected parameters such as process-id, user name, etc. The information displayed by this 
appears similar to top utility. 


Processes on PARAM Open Frame 


F2e Sort TimeSet 


IByProcessNsne 

16119 

S 

10 

1 MB 

1021 

pamm 

0 

By Number 

18456 

s 

201430 

1 MB 

1021 

pannon 

1 

I ByUsarName 

09311 

s 

10 

1 MB 

1021 

pamm 

1 

21999 

s 

1000 

1 1© 

1021 

pamm 

1 

ByUD 

14119 

s 

21090 

8 M3 

1021 

pamm 

0 


13131 

s 

10 

1 MB 

1021 

pamm 

1 

Oy wlz-e 

26415 

s 

50 

2 MB 

1021 

pamm 

0 

fbconsole 

21856 

s 

30 

1 MB 

1021 

pamm 

0 

csh 

08120 

s 

310 

1 MB 

1021 

pamm 

0 

rlogin 

14306 

s 

30 

1 1© 

1021 

pamm 

1 f 

1 Processes an gang* at 18-Mar-98 8:06:33 PM : 

PROCESS NAME 

NUMBER 

STATE 

CPU T (msec) 

JEM 

tJID 

USERNAME 

PROCESSOR 

pantmd 

13294 

0 

6230 

2 MB 

1021 

pamm 

0 

pannond 

04116 

S 

80 

1 MB 

1021 

pamm 

0 

pannasd 

04141 

0 

291315130 

1 MB 

1021 

pannan 

1 

csh. . 

13212 

s 

610 

1 MB 

1021 

pamm. 

1 

IProcesses on. kaveri at 18-Mar-98 8 

: 06 : 33 PM : 





PROCESS NAME 

NUMBER 

STATE 

CPU T {msec) 

JEM 

UID 

USERNAME 

PROCESSOR 

• . csh 

10541 

S 

440 

1 MB 

1021 

pamm 

1 


10569 

0 

1900 

2 MB 

1021 

pamm 

1 

I Processes on cdacb at 18-Mar-98 8:06:33 PM : 

■■ ; < ■■■ 

* 



PROCESS NAIE 

HUMBER 

STATE 

CPU T (msec) 

JEM 

UID 

USERNAME 

PROCESSOR 

csh . 

22139 

. s 

3510 

11© 

2518 

kristao 

0 

pannond 

22815 

0 

14160 

2 MB 

2518 

krishmo 

0 

\ . csh 

22025 

s 

4210 

1 JB 

2518 

feristao 

0 









Piocesjes BreliitaiR*Mhn« timeis s«ttoS secs. ^tcp < ~*° S8 


Figure 5: Information on Resources Consumed by Processes 


324 


PARMON: A Comprehensive Cluster Monitoring System 













5.3 System Logs 

PARMON helps in effective monitoring of system logs maintained by the operating system. It allows 
to process system messages and syslog files for entries that occur at a specific time, or for entries that 
contain a specific keyword or word pattern. 

5.4 Kernel Activities 

PARMON supports software instrumentation of system resources such as CPU, Memory, Disk, and 
Network and their activities. When a particular resource has more than one instance, PARMON 
allows monitoring of each resource instance individually. Invocation of kemel-data-catalog option 
allows instrumentation of kernel activities related to resources such as CPU, memory, disk, and 
network as discussed below. 

CPU Parameters 

The monitoring of CPU parameters helps in understanding how CPU is being utilized. PARMON 
allows to monitor the number of mutex, interrupts, context switches, system calls (Figure 6), forks, 
execs, page in, page out, swap-in, and swap-out operations performed per second and displays these 
activities using graphs. It also allows monitoring of process run queue, I/O queue, and swap queue. 



Figure 6: Systems Calls generated by Processes Running on the node in a Cluster 
Memory Parameters 

PARMON allows continuous instrumentation of memory availability, memory in-use, free memory, 
percentage of memory in-use, reserved swap space, allocated swap space, and available swap space. 

Disk Parameters 

PARMON allows to monitor disk operations such as reads, writes, number of jobs waiting in the 
queue for disk service, and disk’s run-time and wait-time. 


PARMON: A Comprehensive Cluster Monitoring System 


325 

















Network Parameters 

The software instrumentation of network parameters such as input packets, output packets, errors in 
packet transmission, helps in detecting network bottlenecks. PARMON also allows displaying the 
percentage of incoming and outgoing data packets containing packet format errors. 

5.5 Component View: Physical and Logical 

PARMON allows displaying system picture and that of a few important components, which helps the 
user in quick understanding of machine’s physical look-and-feel. It also displays different views of 
the machine. 

The Logical View allows to display system components in a hierarchical manner. System 
components include processing elements, file system, and network components, which are probed 
dynamically. Each item in the hierarchy diagram can be further probed for more detailed information. 
For instance, when file system (fs) is probed, it shows all logical partitions of the disk and other 
details such as partition name, allocated disk space, and disk usage both in terms of bytes and 
percentage. 

5.6 Device Control 

PARMON also supports control of multiple instances of the same resource, like the multiple CPUs in 
a SMP node where it allows to set a CPU in on-line or off-line mode. 

5.7 Events Generation 

PARMON allows to define events such as sending e-mail when the user crosses resource utilization 
limits. This helps the administrator to have effective control over system resource utilization and for 
changing resource allocation policies. 

5.8 Diagnostics 

Under Solaris, PARMON integrates SunVTS to do validation, testing, and stress testing on external 
devices such as disk and tape drives and network connectivity. SunVTS helps in locating problems 
under Solaris with components that have managed to pass the configuration test (POST), as well as 
problems with external devices that POST knows nothing about. 

5.9 Data Representation 

PARMON uses Pie chart, Bar chart and Line graphs for representing resource usage such as disk, 
memory and graphs of representing various kernel activity parameters of CPU, disk, network, etc. 

5.10 On-line Help 

PARMON offers comprehensive online help. User can choose the functionality in the main help menu 
and then navigate to specific details of how each service can be accessed and interpreted. The online 
help also has a glossary of technical terms and parameters that are often used in PARMON or among 
cluster computing community. 

5.11 Miscellaneous Features 

The additional features supported by PARMON includes the following: 

• Allows to issue a specific command on selected nodes or all nodes in a cluster. 

• Allows to list users working on selected nodes or entire cluster. 

• Allows to broadcast message to selected or all nodes in a cluster. 

• Allows to retrieve system information and configuration. 

• Allows to retrieve information related to packages installed on nodes. 


326 


PARMON: A Comprehensive Cluster Monitoring System 



6. PARMON Implementation 

The two major components of PARMON are parmon-server —system resource activities and 
utilization information provider and parmon-client —GUI based client responsible for interacting with 
parmon-server and users for data gathering in real-time and presenting information graphically for 
visualization. The PARMON Client is designed, developed, and implemented using the state-of-the- 
art object-oriented, client-server, and Java Computing technologies. The PARMON-server is being 
developed as a multithreaded server using POSIX/Solaris threads and C as Java does not support 
interfaces to access system internals. The PARMON-client also maintains information related cluster 
setup, groups, node location details. This information needs to be created by PARMON user so that it 
knows about all those nodes, which forms a cluster or a group. 

6.1 Server Implementation 

Multithreaded implementation of servers allows them to serve requests of multiple clients 
simultaneously with improved response time and throughput. The server creates a new thread for 
every request it receives, and assigns responsibility of serving the request to this new thread so that 
the main server-process can handle other requests waiting in queue or wait for new requests. 

PARMON server is a multitheaded server and it communicates with clients over parmon-port (one of 
the free sockets reserved for parmon) and provides defined services as discussed below: 

6.1.1 PARMON Server Socket 

PARMON server opens a connection endpoint by invoking socket () and binds it to a specified port 
or parmon-port through bind (). It blocks and waits for a client request for a connection by invoking 
listen (). When the connect request arrives at the server machine, the server calls accept () to 
accept the connection over which server and clients communicate. Once the connection is established, 
the server creates a service-thread, which is responsible for serving the client request and then it goes 
ahead with serving other requests. 

6.1.2 Multithreading PARMON Server 

PARMON server is multithreaded using POSIX and Solaris Threads Interface to enhance its 
portability. However, only one of them is used for creating executable file through conditional 
compilation. PARMON server and clients communicate with each other by exchanging messages. 
The client requests for parameters through messages, the server creates a thread, called service-thread, 
for each request by calling thr_create () for Solaris threads and pthread_create () for POSIX 
threads. The service-thread interprets the meaning of requests, processes them on local node and 
communicates the results to the client. Both service-thread and client may communicate with each 
other more than once in order to complete the request. The service-thread dies after providing 
requested service by releasing resources (such as buffers, sockets) occupied by it during its lifetime. 

6.1.3 PARMON Services and their Implementation 

The services provided by server can be mainly classified into two categories: the first, kernel related 
services and the second, system status and their configuration related services. The kernel related 
services focus on the activities on CPU, Disk, Memory, Network during process execution. The 
second type of services offered by PARMON includes the information about process activities, 
system logs, users, physical and logical view, system information, and system configuration. The 
CPU parameters include busy(%), idle(%), sys(%), user(%), wait(%), context_switches, interrupts, 
system calls, forks, execs, pgin, pgout, run_queue_length, io_queue_length, etc. The Disk parameters 
include operations such as reads, writes, disk wait percentage, disk active percentage, disk service 
percentage, etc. The memory parameters consist of available memory, free memory, used memory, 
swap avail, swap free, swap reserved, etc. Input packets, output packets, input error, output error, 
collisions, error percentage, defer percentage are some of the network parameters. 


PARMON: A Comprehensive Cluster Monitoring System 


327 



Kernel Parameters 

In Solaris, the kstat (kernel statistics) facility is a general purpose mechanism for providing kernel 
statistics to the users. The kernel maintains a linked list of kstats. Each kstat has a common header 
section and a type-specific data section and it can be characterized as belonging to some broad class 
of statistics, for instance, disk, net, vm, etc. This field can be used as a filter to extract related kstats. 
Solaris offers a set of interface functions to access kernel statistics. The kstat_open () initializes the 
kstat control structure, which provides access to the kernel statistics library. Call to kstat_read ( ) 
gets data from the kernel for the kstat, and kstat_close () frees all the resources that were 
associated with the pointer returned by kstat_open () . 

Physical memory parameters data can be obtained through sysconf (). Swap memory information 
is obtained through calls to Kernel VM routines: The kvm_open () initializes a set of file descriptors 
to be used in subsequent cedis to these routines. The kvm_read () transfers data from the kernel image 
specified by pointer returned by the kvm_open() . The kvm_close () closes all the file descriptors 
associated with the pointer returned by kvm_open ( ). 

System Status and Configuration 

PARMON supports access to parameters related to system configuration, users, and processes; and 
they are implemented by using relevant system calls [7]. The getutxent () allows to access 
information related to system users. The readdir_r () allows to access information related to 
proccesses. The sysconf () allows to access system configuration such as maximum number of files 
that can be opened simultaneously, maximum size of a core memory page. The sysinf o () allows to 
access system information such as host name, release number, system architecture, vendor details. 

6.2 PARMON Client Implementation in Java 

PARMON client is a GUI based tool, which interacts with users and server. The communication 
between PARMON server and client takes place by sending or receiving messages over TCP/IP 
sockets. The following interaction takes place between PARMON user, client, and server repeatedly 
for every action of the user: 

1. The user interacts with PARMON client for monitoring and selects appropriate option. 

2. The PARMON client interprets user requests and invokes appropriate event handler or closes 
the current window for exit or close window request. 

3. The event-handler performs appropriate function. In most cases, it creates an object of user- 
defined service class extending the Frame class and having the capability to interact with 
server and the user. 

4. The user selects a set of nodes to be monitored. 

5. PARMON client maps node name to I/P address by accessing node-database and establishes a 
connection with the server by creating an object of the Socket class by supplying IP address 
and parmon-socket number (default or user supplied). It also creates input and output streams 
for this socket object. 

6. PARMON client converts user request into messages and communicates the same to server. 

7. PARMON server processes client requests and communicates the results to the client. 

8. PARMON client interprets messages and converts the same into appropriate format—presents 
it in the form of graph, pie chart, bar graph, table, or in a text box. 

9. PARMON client then closes all connections made to servers. 

The user class ParmonSocket is one of the most commonly used classes. It is responsible for 
mapping node names to their I/P-address by accessing the node database, establishing the socket 
connection to the server, and creating input and output streams to the socket. This class has methods 
for reading from the server and writing commands to the server through socket. The constructor of 


328 


PARMON: A Comprehensive Cluster Monitoring System 



the class ParmonSocket has the following syntax: 

public ParmonSocket( String nodes[], int port_id) 

The ParmonSocket constructor takes two arguments, nodes and port_id. The argument nodes is 
an array of the class string, which contains the names of the servers to which the client may request 
for connection through socket. The argument port_id is an integer variable representing socket/port 
number over which clients requests for connection to the server. 

The PARMON client need to interpret server response for its request as follows: The server 
response to client request contains the following two items. 

I Request_Status I Data/Error I 

The Request_Status can be SUCC, STOP, or CONT and the client needs to interpret it as follows: 
SUCC: The server honors the client request and next item, Data, contains the result. 

STOP: The server could not serve the request from the client, so stop further requests through 
the current connection. The second item, Error, contains the error message and the reason for 
this error. 

CONT: The current request cannot be served and remaining requests from client can continue. 
The second item, Error, contains the error message and the reason for this error. 

The processing is simplified by the class ParmonSocket by supporting methods read () and 
response!). After sending request, invoke read (int i) which reads node[i] messages and 
returns Request_Status. It also separates status condition with the response portion, which can be 
accessed through the method response (). 

Kernel Data Catalog window in Java 

The Kernel Data Catalog window shown in Figure 7 allows to select node and type of parameters to 
be instrumented. This class extends the class j ava. awt. Frame and its layout is set to null by 
setLayout(null) , so that components position can be controlled by the programmer. This 
window's resizable property is set to false by setResizable (false). 



Monitor CPU 
— 


yamuna 


busy (%) 

r~-. ■■■."7 . 

f~~ 


idle C4«) 

ncpu (CPUs) 

■■■SSSRSESSIHH 





fc— ^IIMi 111 illBM 

pgout(timwwbac) j 




Figure 7: Kernel Data Catalog 


PARMON: A Comprehensive Cluster Monitoring System 


329 


















Two panels are created and added to this window. The left panel shows parameters related to the 
resources, which the user selects for viewing the activities. This panel has a choice component whose 
items are CPU, disk, network and Memory and another panel contains buttons indicating parameters 
related to the resource selected in the choice component. 

The right panel shows the group lists and respective nodes in them. Any one of these nodes needs 
to be selected at a time for monitoring resource activities. The required resource activity will be 
chosen from left panel. This panel has choice component and another panel with radio-buttons. The 
choice component has group names as its items. The group information will be obtained from group 
database. The panel having radio-buttons represent nodes and therefore only one node can be selected 
for monitoring. When the group is selected from the choice component, the nodes belonging to that 
group are displayed in the radio-buttons panel. 

When the user clicks on any parameter, the event handler will be invoked by supplying event 
details. The event handler will determine type of parameter to be monitored and node name and 
invokes/creates appropriate method/object, which performs instrumentation of the selected parameter 
on the selected node. 

6.3 Node and Group Database 

The PARMON client maintains nodes and group information in Nodes. db and Groups. db files. The 
file Nodes. db maintains the information about nodes in the system for monitoring. The user interface 
for creating node database is shown in Figure 8. It takes the name of the server and its IP address. On 
the righthand side, there are a list of nodes. If the node has to be removed from this list, select 
respective node name and click on Remove Node button and click on Ok button to incorporate these 
changes into the Nodes. db file. 



lliSl§ls»8ll§i 


Node Name 


Nodes Box 

yamuna 202.141.63.191 .fexport/hor 
ganga 202.141.63.190 fexpoithome 
kaveri 202.14163.192 /fexporfchome 
odacb 202.141.63.1 fexport/home 
jupiter 202.141.63.110 ifexportihome 
suiya 202.14163.209 /fexport/home 
test2 202.141.63.185 fexport/home 
test3 202.14163.186 fexport/horre 
test4 202.141 63.187 fexport/home 


narmada* 


IP Address 


fexportfeome 


User Disk Area 


/export/ho 


Figure 8: Node Setup window 


PARMON allows creation of user groups containing a set of nodes based on resource allocation 
and its monitoring. The Group Create window allows the user to select a set of nodes that forms the 
user group. The group details (group name and its nodes) are maintained in the Groups . db file. The 
Group Modify window offers the flexibility to change the configuration of the group by selecting new 
nodes or deselecting nodes from the list of nodes. 


330 


PARMON: A Comprehensive Cluster Monitoring System 
















7. Related Works 

There are many projects investigating system administration of workstations and clusters supporting 
parallel computing across networks of workstations, including NOW (Network of Workstations) at 
the UC Berkeley and SMILE (Scalable Multicomputer Implementation using Low-cost Equipments) 
at Kasetsart University, Bangkok, and Solstice 1 ™ SyMon™ at Sun Microsystems. NOW system 
administration tool gathers and stores data in the relational database. It uses a Java applet as primary 
interface allowing users anywhere in the world to monitor the system from their browser [8]. SMILE 
administration tool is called K-CAP and its environment consists of compute nodes —executes the 
compute-intensive tasks, management node —file server and cluster manager, and management 
console —controls and monitors the cluster [9]. The K-CAP user interface uses WWW and Java 
Applets for connecting to K-CAP management node through predefined URL address in the cluster. 
SyMon allows to monitor a standalone workstation, but uses client-server technology for separating 
the node to be monitored and the monitoring station [10]. 

The NSR (Node Status Reporter) provides a standard mechanism for measurement and access to 
status information in clusters of workstations [11]. Parallel applications/tools can access NSR through 
its Tool/NSR Interface. 

Unlike NOW and SMILE cluster monitoring systems, PARMON user interface is developed as a 
Java application instead of Applet. PARMON also allows the user to monitor the cluster system from 
anywhere in the world if nodes of the cluster are accessible through Internet. Global parallel clusters 
built using Internet servers can easily be monitored using PARMON. 

8. Conclusions and Future Directions 

PARMON is being successfully used in monitoring PARAM OpenFrame Supercomputer, which is a 
cluster of 48 Ultra-4 workstations running SUN-Solaris operating system. It is scalable across a 
cluster of hundreds of workstations including Internet-based global parallel clusters. 

PARMON introduces some overhead on network as its clients and server communicates with each 
other by exchanging messages. The message size and the amount of time required to process a request 
depends on the request-type. There is no way to avoid network overhead introduced by PARMON as 
it uses client-server paradigm. 

Our experience in building PARMON shows that Java offers a flexible environment for building 
scalable and extensible applications. PARMON can be easily extended to support new features as it 
has been designed using latest object-oriented and internet-based Java technology. PARMON client is 
fully portable as it has been developed using Java, and the server needs a few changes to support new 
platforms. 

PARMON can easily be extended to support web-based user interface for monitoring. This can be 
achieved by converting PARMON-client (which is a Java application) into Applet; and building an 
intermediate server running on web-sever that acts as link between PARMON server running on 
cluster nodes and PARMON web-based user interface containing Applet. Security is another 
important issue that PARMON needs to address when web-based monitoring support is provided. 

Good cluster management systems are crucial in exploiting the computer clusters as a high 
performance computing platform. The future (high performance) computing systems are expected to 
be cluster based and heterogeneous in nature. To continue with this trend, we are porting PARMON 
to Linux. We are also planning to port PARMON to Windows NT, as NT-based clusters are becoming 
popular. The availability of PARMON for Unix, Linux, and NT-based clusters, makes it a truly 
heterogeneous cluster monitoring system. 

Acknowledgments 

The authors are grateful to all the members of C-DAC, Bangalore. In particular, we thank Mohan 


PARMON: A Comprehensive Cluster Monitoring System 


331 



Ram, Arun Babu, Nitin, Raghavendra, Vikas, and Soma Sundaram for their support, suggestions, 

comments, and advises during the design and development.of PARMON. We thank Smrithi R; and 

Mudlapur R and Binu Thomas of Queensland University of Technology, for proof reading drafts of 

this paper. 

References 

1. T. Anderson, D. Culler, D. Patterson, and the NOW team. A Case for NOW (Networks of 
Workstations). IEEE Micro, pages 54-64, February 1995. 

2. Rajkumar. Single System Image: Need, Approaches, and Supporting HPC Systems. Proceedings 
of the Fourth International Conference on Parallel and Distributed Processing, Techniques and 
Applications (PDPTA'97), CSREA Publishers, Las Vegas, USA, 1997. 

3- Sun^Microsystems. Java Language - A White Paper. Sun Microsystems Computer Company, 

4. Patrick Naughton and Herbert Schildt. JAVA: The Complete Reference. Mc-Graw Hill Inc., 1997. 

5. Mohan Ram N, Sasi Kumar S, Vijay P Bhatkar, and Arora R K. HPCC Software: A Scalable 
Parallel Programming Environment for UNIX Clusters. 1998. 

6. HPCC System Software PARMON Group. PARMON User Manual. © C-DAC, 1998. 

7. Bemey Goodheart and James Cox. The Magic Garden Explained: The internals of UNIX System 
V Release 4. Prentice Hall, 1993. 

8. Eric Anderson and Dave Patterson. Extensible, Scalable Monitoring for Clusters of Computers. 
Proceedings of the 11th Systems Administration Conference (LISA '97), October 26-31 1997 
San Diego, California, USA. 

9. Putchong Uthayopas, Chaiyapom Jaikaew, and Thitiwan Srinak. Interactive Management of 
Workstation Clusters Using World Wide Web. Cluster Computing Conference-CCC '97, Web 
Proceedings, http://www.mathc s.emorv.edu/~ccc97/ . (http://www.eng.ku.ac.th/~pu/smile/), 1997. 

10. Sun Microsystems. Solstice SyMON 1.1 User’s Guide, Palo Alto, CA 1996. 

11. C. Roder, T. Ludwig, and A. Bode. Flexible Status Measurement in Heterogeneous Environment. 
Proceedings of the Fifth International Conference on Parallel and Distributed Processing, 
Techniques and Applications (PDPTA'98), CSREA Publishers, Las Vegas, USA, 1998^ 


332 


PARMON: A Comprehensive Cluster Monitoring System 



e-business 


WHAT’S THE DIFFERENCE BETWEEN A LITTEE KID WITH A WEB SITE AND 
A MAJOR CORPORATION WITH ONE? NOTHING. THAT’S THE PROBLEM. 



Building a publishing-only Weds site is the first 
step to becoming an e-business. A step that most 
businesses (and a lot of little kids) have already 
taken. That’s line as far as it goes - it’s a very cost- 
efficient way to distribute basic information. 

But the real payoff (for businesses, at least) 
comes with steps two and three. Step two is moving 
to ‘"self-service” Web sites - where customers can 
do things like check the status of an account or 
trace the status of a package delivery online. 

Step three is moving to transaction-based 
Web sites - not just buying and selling, but all 
processes that require a dynamic and interactive 
How ol information. 

IBM has already helped thousands of com¬ 
panies use the Web to make the leap from being 
a business with a Web site to being an e-business 
- putting their core processes online to improve 
service, cut costs or to actually sell things. 

For example, we helped Charles Schwab Web- 


enable their brokerage systems for online trading 
and customer service. Since opening, Schwab’s 
Web service has generated over one million online 
accounts totalling over $100 billion in assets. 

e-business economics are compelling. Accord¬ 
ing to a recent Booz-Allen & Hamilton study in the 
US, a traditional bank transaction costs $1.39; the 
same transaction over the Web costs about one 
cent. Similarly a traditional airline ticket costs $12 
to process in the US; an e-ticket costs just $1.50. 
Customers love the convenience. Management loves 
the lower costs. 

IBM solutions have already helped thousands of 
businesses become e-businesses. To find out more 
about the latest IBM solutions visit www.ibm.com.au 
or call us today on 132 426 and ask for e- business/info. 



Solutions for a small planet'" 



Ask IBM why the new 
Lotus Domino GO Web 
Server Pro with its 
advanced Web site design 
management tools, 
security capabilities and 
scalable design may well 
be the ultimate on-ramp 
to the Internet. 



Ask IBM how you can 
get more from your 
investment. Lotus 
Domino, Net.Commerce, 
VisualAge” for Java'" and 
other IBM e-business 
solutions run on popular 
platforms including 
Windows NTT 


WWW. 

ibm 

com.au 

You BECOME an 
e-business by connecting 
your traditional IT 
systems to the Web. 

You DO e-business by 
taking the information 
in those systems and 
deploying it in new ways 
over the Web. 


IBM. Solutions for a small planet, the e-business logo and other marks designated * or ™ are trademarks of International Business Machines Corporation in the United States and/or other 
countries. Windows NT is a trademark of Microsoft Corp. Lotus and Domino are trademarks of Lotus Development Corp. in the United States and/or other countries. Java is a trademark of 
Sun Microsystems. Other company's product and/or service names may be trademarks or service marks of their respective companies. ©1998 IBM Australia Ltd. ACN 000 024 733. 
0gilvyBRD466 







